douzou0073 2017-04-23 20:56
浏览 153
已采纳

preg_match_all提取字符串部分的最佳模式是什么?

Context ;

• from file_get_contents from url, i get lots of stuff like <item></item>, <url></url>, etc.

• i'm using preg_match_all to extract url, title, etc.

example:

$jStringToSubStract = '<a>stuffA</a><b>stuffB</b><url>http...</url>';
preg_match_all("#<url>(.*?)<\/url>#sx", $jStringToSubStract , $subItems, PREG_SET_ORDER);
foreach ( $subItems as $subItem  ) {        
        if ( strlen ($subItem[1]) > 0 ) {
            echo $subItem[1]; // this is returning the http... INSIDE <url></url> 
        }
}

but it's slow for a large amount...

Is there a faster alternative to preg_match_all to extract portion of strings ?

  • 写回答

2条回答 默认 最新

  • doupo2241 2017-05-25 06:53
    关注

    After seeing your posted solution, I now understand what you are trying to achieve. Since you are capturing only substrings in the format of [attrname]=[attrvalue] (which may be single quoted, double quoted, or not quoted at all), these are optimized patterns for you...

    This one will get ALL attributes: \K\S+=["']?[^>"']+["']?>?? Demo

    This one will get specific attributes: \K(?:alt|title|src|href)=["']?[^>"']+["']?>?? Demo

    These patterns do not use capture groups. This means your code will avoid unnecessary result array bloat and access the substrings as fullstring matches. Both of these patterns will run more efficiently than the patterns you have posted.

    I should also mention that both my patterns and your patterns are not 100% reliable because there is no check that these substrings are actually inside of html tags. This is the reason why html-parsing programs are strenuously encouraged. If you are certain that the text that you'll be reading won't have any free floating \S=\S formatted strings outside of the tags, then the results will be fine.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 #MATLAB仿真#车辆换道路径规划
  • ¥15 java 操作 elasticsearch 8.1 实现 索引的重建
  • ¥15 数据可视化Python
  • ¥15 要给毕业设计添加扫码登录的功能!!有偿
  • ¥15 kafka 分区副本增加会导致消息丢失或者不可用吗?
  • ¥15 微信公众号自制会员卡没有收款渠道啊
  • ¥100 Jenkins自动化部署—悬赏100元
  • ¥15 关于#python#的问题:求帮写python代码
  • ¥20 MATLAB画图图形出现上下震荡的线条
  • ¥15 关于#windows#的问题:怎么用WIN 11系统的电脑 克隆WIN NT3.51-4.0系统的硬盘