drgm51600 2013-11-18 08:40
浏览 56

寻找强大的HTML DOM方法来正确提取包含单个撇号的属性的文本值

As part of a migration task of data, I am extracting some data from some html, the values in alt and title attributes of the img html element using PHP.

An example of the source html is:

<img src='myimage.jpg' alt='Andy's garden vegetables' title='Andy's garden vegetables'/>

As you can see, in the source html, the values of the alt and title attributes have their start and finish (container characters) denoted by a single apostrophe ' But within the text itself, the single apostrophe is used in possessive ownership sense to say vegetables belonging to Andy.

So for a simple parser, this is going to be problematic as it would incorrectly regard the apostrophe within the text as the end of the value, as in 'Andy' rather than 'Andy's garden vegetables'.

The solution I can think of to incorporate further surrounding text into a regex to clarify the start and finish of the attribute value, as in the alt=' and the ' at the end. Though this would not work if there are spaces between the = or if double quotes were used. I think that the ' single apostrophes may not be legal html but that is the data I have to work with.

Is there a more robust solution than regex, perhaps html dom based that can handle ' single apostrophes within the text and distinguish them from being used as delimiters?

  • 写回答

2条回答 默认 最新

  • dougehe2022 2013-11-18 08:50
    关注

    I think this is what you're asking for?:

    (?<=alt='|title=').+(?='\s)
    

    I just used positive lookahead/lookbehind to identify the tags and the closing single apostrophe.

    评论

报告相同问题?

悬赏问题

  • ¥15 素材场景中光线烘焙后灯光失效
  • ¥15 请教一下各位,为什么我这个没有实现模拟点击
  • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
  • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
  • ¥20 有关区间dp的问题求解
  • ¥15 多电路系统共用电源的串扰问题
  • ¥15 slam rangenet++配置
  • ¥15 有没有研究水声通信方面的帮我改俩matlab代码
  • ¥15 ubuntu子系统密码忘记
  • ¥15 保护模式-系统加载-段寄存器