I am trying to index some content from a series of .html's that share the same format.
So I get a lot of lines like this: <a href="meh">[18] blah blah blah < a...
And the idea is to extract the number (18) and the text next to it (blah...). Furthermore, I know that every qualifying line will start with ">
and end with either <a
or </p
. The issue stems from the need to keep all other htmHTML tags as part of the text (<i>
, <u>
, etc.).
So then I have something like this:
$docString = file_get_contents("http://whatever.com/some.htm");
$regex="/\">\ [(.*?)\ ] (<\/a>)(.) *?(<)/";
preg_match_all($regex,$docString,$match);
Let's look at $regex
for a sec. Ignore it's spaces, I just put them here because else some characters disappear. I specify that it will start with ">
. Then I do the numbers inside the []
thing. Then I single out the </a>
. So far so good.
At the end, I do a (.)*?(<)
. This is the turning point. By leaving the last bit, (<)
like that, The text will be interrupted when an underline or italics tag is found. However, if I put (<a|</p)
the resulting array ends up empty. I've tried changing that to only (<a)
, but it seems that 2 characters mess up the whole ting.
What can I do? I've been struggling with this all day.