I am curling from a page with very ill-formed code. There is a particular snippet of the page I am trying to parse into paragraphs. This input snippet may be divided by <p>
and </p>
or separated by one or more <br>
or <br/>
tags. In cases where there are two <br>
tags after another, I don't want those to be two separate pargaraphs.
My current code I'm trying to parse/display with is
$paragraphs = preg_split('/(<\s*p\s*\/?>)|(<\s*br\s*\/?>)|(\s\s+)|(<\s*\/p\s*\/?>)/', $article, -1, PREG_SPLIT_NO_EMPTY);
$paragraphcount = count($paragraphs);
for($x = 1; $x <= $paragraphcount; $x++ )
{
echo "<p>".$paragraphs[$x-1]."</p>";
}
However, this is not working as expected. Some different inputs/outputs are as follows:
Input 1: first part </p> <p> second part </p> <p> third part </p> <p> fourth part <br/>
Output 1: <p>first part </p><p> </p><p>second part </p><p> </p><p> third part </p><p> </p><p>fourth part</p><p> </p>
My code is parsing the input into paragraphs; however, it's also adding extra paragraphs containing only a space.
Any help would be appreciated.
Input is UTF-8 if it makes a difference.