ds34222 2018-05-21 19:54
浏览 51
已采纳

正则表达式选择特定的html元素[Curl / PHP]

I am trying to scrape some specific data and output them in my site.

what i want to extract-

Im using Curl in PHP and this is the regular expression im trying to use but it gives me an error Fatal error: Allowed memory size of ram bytes exhausted which means it takes lot of files.

code:

preg_match_all('!<th scope="(\b[a-zA-Z]+\b)">(\b[a-zA-Z]+\b)<\/th><td><a href="\/wiki\/(\b[a-zA-Z]+\b)" title="(\b[a-zA-Z]+\b)">(\b[a-zA-Z]+\b)<\/a>!',$result,$cap_matches);
$cap_name = array_values(array_unique($cap_matches[0]));
echo $cap_name[0];

ive tried to make regular expression only the "a ..." tag but i get lot of results back, i just want to grab the capital.

  • 写回答

1条回答 默认 最新

  • 红酒泡绿茶 2018-05-21 22:16
    关注

    do not parse HTML with regex. use a proper HTML parser instead, like DOMDocument.

    $domd = @DOMDocument::loadHTML ( $result );
    unset($result);
    $xp = new DOMXPath ( $domd );
    $capital = $xp->query ( '//th[text()="Capital"]/following-sibling::td/a' )->item ( 0 )->getAttribute("title");
    unset($domd,$xp);
    var_dump ( $capital );
    

    as for avoiding OOM errors, try wrapping your most memory hungry operations in smaller functions, letting the garbage collector clean everything on function exit, or unset() your big variables asap when they're no longer needed.. (i wouldn't normally use unset() in the code above, but since you were specifically complaining about OOM errors, i did). another obvious solution is to increase the memory limit, eg

    if(false===ini_set("memory_limit","1G")){
        throw new \RuntimeException('error, unable to change memory limit!');
    };
    

    should set the memory limit to 1 gigabyte, up from the default 128 megabytes.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥20 测距传感器数据手册i2c
  • ¥15 RPA正常跑,cmd输入cookies跑不出来
  • ¥15 求帮我调试一下freefem代码
  • ¥15 matlab代码解决,怎么运行
  • ¥15 R语言Rstudio突然无法启动
  • ¥15 关于#matlab#的问题:提取2个图像的变量作为另外一个图像像元的移动量,计算新的位置创建新的图像并提取第二个图像的变量到新的图像
  • ¥15 改算法,照着压缩包里边,参考其他代码封装的格式 写到main函数里
  • ¥15 用windows做服务的同志有吗
  • ¥60 求一个简单的网页(标签-安全|关键词-上传)
  • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法