dsij89625 2015-12-22 01:08
浏览 49
已采纳

preg_match在使用cURL获取数据时错过了一些ID

For learning purposes, I'm trying to fetch data from the Steam Store, where if the image game_header_image_full exists, I've reached a game. Both alternatives are sort of working, but there's a catch. One is really slow, and the other seems to miss some data and therefore not writing the URL's to a text file.

For some reason, Simple HTML DOM managed to catch 9 URL's, whilst the 2nd one (cURL) only caught 8 URL's with preg_match.

Question 1.

Is $reg formatted in a way that $html->find('img.game_header_image_full') would catch, but not my preg_match? Or is the problem something else?

Question 2.

Am I doing things correctly here? Planning to go for the cURL alternative, but can I make it faster somehow?

Simple HTML DOM Parser (Time to search 100 ids: 1 min, 39s. Returned: 9 URL.)

<?php
    include('simple_html_dom.php');

    $i = 0;
    $times_to_run = 100;
    set_time_limit(0);

    while ($i++ < $times_to_run) {
        // Find target image
        $url = "http://store.steampowered.com/app/".$i;
        $html = file_get_html($url);
        $element = $html->find('img.game_header_image_full');

        if($i == $times_to_run) {
            echo "Success!";
        }

        foreach($element as $key => $value){
        // Check if image was found
            if (strpos($value,'img') == false) {
                // Do nothing, repeat loop with $i++;

            } else {
                // Add (don't overwrite) to file steam.txt
                file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
            }
        }
    }
?>

vs. the cURL alternative.. (Time to search 100 ids: 34s. Returned: 8 URL.)

<?php

    $i = 0;
    $times_to_run = 100;
    set_time_limit(0);

    while ($i++ < $times_to_run) {

        $ch = curl_init();
        curl_setopt( $ch, CURLOPT_URL, 'http://store.steampowered.com/app/'.$i);
        curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true);
        $content = curl_exec($ch);

        $url = "http://store.steampowered.com/app/".$i;

        $reg = "/<\\s*img\\s+[^>]*class=['\"][^'\"]*game_header_image_full[^'\"]*['\"]/i";

        if(preg_match($reg, $content)) {
            file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
        }

    }

?>
  • 写回答

1条回答 默认 最新

  • dsfds4551 2015-12-22 01:48
    关注

    Well you shouldn't use regex with HTML. It mostly works, but when it doesn't, you have to go through hundreds of pages and figuring out which one is the failing one, and why, and correct the regex, then hope and pray that in the future nothing like that will ever happen again. Spoiler alert: it will.

    Long story short, read this funny answer: RegEx match open tags except XHTML self-contained tags

    Don't use regex to parse HTML. Use HTML parsers, which are complicated algorithms that don't use regex, and are reliable (as long as the HTML is valid). You are using one already, in the first example. Yes, it's slow, because it does more than just searching for a string within a document. But it's reliable. You can also play with other implementations, especially the native ones, like http://php.net/manual/en/domdocument.loadhtml.php

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 #MATLAB仿真#车辆换道路径规划
  • ¥15 java 操作 elasticsearch 8.1 实现 索引的重建
  • ¥15 数据可视化Python
  • ¥15 要给毕业设计添加扫码登录的功能!!有偿
  • ¥15 kafka 分区副本增加会导致消息丢失或者不可用吗?
  • ¥15 微信公众号自制会员卡没有收款渠道啊
  • ¥100 Jenkins自动化部署—悬赏100元
  • ¥15 关于#python#的问题:求帮写python代码
  • ¥20 MATLAB画图图形出现上下震荡的线条
  • ¥15 关于#windows#的问题:怎么用WIN 11系统的电脑 克隆WIN NT3.51-4.0系统的硬盘