ds34222 2018-05-21 19:54
浏览 51
已采纳

正则表达式选择特定的html元素[Curl / PHP]

I am trying to scrape some specific data and output them in my site.

what i want to extract-

Im using Curl in PHP and this is the regular expression im trying to use but it gives me an error Fatal error: Allowed memory size of ram bytes exhausted which means it takes lot of files.

code:

preg_match_all('!<th scope="(\b[a-zA-Z]+\b)">(\b[a-zA-Z]+\b)<\/th><td><a href="\/wiki\/(\b[a-zA-Z]+\b)" title="(\b[a-zA-Z]+\b)">(\b[a-zA-Z]+\b)<\/a>!',$result,$cap_matches);
$cap_name = array_values(array_unique($cap_matches[0]));
echo $cap_name[0];

ive tried to make regular expression only the "a ..." tag but i get lot of results back, i just want to grab the capital.

  • 写回答

1条回答 默认 最新

  • 红酒泡绿茶 2018-05-21 22:16
    关注

    do not parse HTML with regex. use a proper HTML parser instead, like DOMDocument.

    $domd = @DOMDocument::loadHTML ( $result );
    unset($result);
    $xp = new DOMXPath ( $domd );
    $capital = $xp->query ( '//th[text()="Capital"]/following-sibling::td/a' )->item ( 0 )->getAttribute("title");
    unset($domd,$xp);
    var_dump ( $capital );
    

    as for avoiding OOM errors, try wrapping your most memory hungry operations in smaller functions, letting the garbage collector clean everything on function exit, or unset() your big variables asap when they're no longer needed.. (i wouldn't normally use unset() in the code above, but since you were specifically complaining about OOM errors, i did). another obvious solution is to increase the memory limit, eg

    if(false===ini_set("memory_limit","1G")){
        throw new \RuntimeException('error, unable to change memory limit!');
    };
    

    should set the memory limit to 1 gigabyte, up from the default 128 megabytes.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥50 易语言把MYSQL数据库中的数据添加至组合框
  • ¥20 求数据集和代码#有偿答复
  • ¥15 关于下拉菜单选项关联的问题
  • ¥20 java-OJ-健康体检
  • ¥15 rs485的上拉下拉,不会对a-b<-200mv有影响吗,就是接受时,对判断逻辑0有影响吗
  • ¥15 使用phpstudy在云服务器上搭建个人网站
  • ¥15 应该如何判断含间隙的曲柄摇杆机构,轴与轴承是否发生了碰撞?
  • ¥15 vue3+express部署到nginx
  • ¥20 搭建pt1000三线制高精度测温电路
  • ¥15 使用Jdk8自带的算法,和Jdk11自带的加密结果会一样吗,不一样的话有什么解决方案,Jdk不能升级的情况