I'm currently working on a way to parse a HTML-document into a database. I'm not allowed to change any formatting from the HTML document. In the following example i need to find which tags have class id "Category", and then grab the data within this tag, in this example "Example Text".
How do I get the code to not only match tags which are directly ended afterwards?
$tags = "<p class=Category style='margin-left:0in;text-indent:0in'><a name='_
Toc390163149'></a><a name='_Ref388370252'></a><a
name='_Toc122858606'><span lang=EN-GB>3.<span style='font:7.0pt 'Times New
Roman''> </span></span><span lang=EN-GB>Example Text</span></a></p>";
preg_match_all("/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/", $tags, $matches, PREG_SET_
foreach ($matches as $val) {
echo "matched: " . htmlspecialchars($val[0]) . "</br>";
echo "part 1: " . htmlspecialchars($val[1]) . "</br>";
echo "part 2: " . htmlspecialchars($val[2]) . "</br>";
echo "part 3: " . htmlspecialchars($val[3]) . "</br>";
echo "part 4: " . htmlspecialchars($val[4]) . "</br></br>";
}
Outputs:
matched: <a name="_Toc390163149"></a>
part 1: <a name="_Toc390163149">
part 2: a
part 3:
part 4: </a>
matched: <a name="_Ref388370252"></a>
part 1: <a name="_Ref388370252">
part 2: a
part 3:
part 4: </
matched: <span lang=EN-GB>When not to follow Rules</span>
part 1: <span lang=EN-GB>
part 2: span
part 3: When not to follow Rules
part 4: </span>
Any ideas?