As part of a migration task of data, I am extracting some data from some html, the values in alt
and title
attributes of the img
html element using PHP.
An example of the source html is:
<img src='myimage.jpg' alt='Andy's garden vegetables' title='Andy's garden vegetables'/>
As you can see, in the source html, the values of the alt
and title
attributes have their start and finish (container characters) denoted by a single apostrophe '
But within the text itself, the single apostrophe is used in possessive ownership sense to say vegetables belonging to Andy.
So for a simple parser, this is going to be problematic as it would incorrectly regard the apostrophe within the text as the end of the value, as in 'Andy'
rather than 'Andy's garden vegetables'
.
The solution I can think of to incorporate further surrounding text into a regex to clarify the start and finish of the attribute value, as in the alt='
and the '
at the end. Though this would not work if there are spaces between the =
or if double quotes were used. I think that the '
single apostrophes may not be legal html but that is the data I have to work with.
Is there a more robust solution than regex, perhaps html dom based that can handle '
single apostrophes within the text and distinguish them from being used as delimiters?