preg_match在使用cURL获取数据时错过了一些ID

For learning purposes, I'm trying to fetch data from the Steam Store, where if the image game_header_image_full exists, I've reached a game. Both alternatives are sort of working, but there's a catch. One is really slow, and the other seems to miss some data and therefore not writing the URL's to a text file.

For some reason, Simple HTML DOM managed to catch 9 URL's, whilst the 2nd one (cURL) only caught 8 URL's with preg_match.

Question 1.

Is $reg formatted in a way that $html->find('img.game_header_image_full') would catch, but not my preg_match? Or is the problem something else?

Question 2.

Am I doing things correctly here? Planning to go for the cURL alternative, but can I make it faster somehow?

Simple HTML DOM Parser (Time to search 100 ids: 1 min, 39s. Returned: 9 URL.)

<?php
    include('simple_html_dom.php');

    $i = 0;
    $times_to_run = 100;
    set_time_limit(0);

    while ($i++ < $times_to_run) {
        // Find target image
        $url = "http://store.steampowered.com/app/".$i;
        $html = file_get_html($url);
        $element = $html->find('img.game_header_image_full');

        if($i == $times_to_run) {
            echo "Success!";
        }

        foreach($element as $key => $value){
        // Check if image was found
            if (strpos($value,'img') == false) {
                // Do nothing, repeat loop with $i++;

            } else {
                // Add (don't overwrite) to file steam.txt
                file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
            }
        }
    }
?>

vs. the cURL alternative.. (Time to search 100 ids: 34s. Returned: 8 URL.)

<?php

    $i = 0;
    $times_to_run = 100;
    set_time_limit(0);

    while ($i++ < $times_to_run) {

        $ch = curl_init();
        curl_setopt( $ch, CURLOPT_URL, 'http://store.steampowered.com/app/'.$i);
        curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true);
        $content = curl_exec($ch);

        $url = "http://store.steampowered.com/app/".$i;

        $reg = "/<\\s*img\\s+[^>]*class=['\"][^'\"]*game_header_image_full[^'\"]*['\"]/i";

        if(preg_match($reg, $content)) {
            file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
        }

    }

?>

展开全部

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dsfds4551 2015-12-21 17:48
关注
Well you shouldn't use regex with HTML. It mostly works, but when it doesn't, you have to go through hundreds of pages and figuring out which one is the failing one, and why, and correct the regex, then hope and pray that in the future nothing like that will ever happen again. Spoiler alert: it will.

Long story short, read this funny answer: RegEx match open tags except XHTML self-contained tags

Don't use regex to parse HTML. Use HTML parsers, which are complicated algorithms that don't use regex, and are reliable (as long as the HTML is valid). You are using one already, in the first example. Yes, it's slow, because it does more than just searching for a string within a document. But it's reliable. You can also play with other implementations, especially the native ones, like http://php.net/manual/en/domdocument.loadhtml.php

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报
编辑

预览
轻敲空格完成输入
显示为

卡片

标题

链接
评论

按下Enter换行，Ctrl+Enter发表内容

编辑

预览

报告相同问题？

关注问题

您如何在PHP中解析和处理HTML / XML？
2019-12-04 02:40

asdfgh0077的博客 preg_match_all('#<([A-Za-z0-9_]*)[^>]*>(.*?)</\1>#s', $str, $matches)) ? $str : array_map(('extract_data'), array_combine_($matches[1], $matches[2]))); } print_r(extract_data(file_get_contents(...
实用的php文件操作类
2017-08-23 14:29

少林码僧的博客 <?php class File { /** * 创建多级目录 * @param string $dir * @param int $mode * @return boolean */ public function create_dir($dir,$mode=0777) { return is_dir($dir
PHP面试题(一)
2018-03-24 03:56

钟长森的博客用PHP实现一个双向队列(使用deque) deque，全名double-ended queue，是一种具有队列和栈的性质的数据结构。双端队列中的元素可以从两端弹出，其限定插入和删除操作在表的两端进行。双向队列（双端队列）就像是一个...
慎重决定！从自建服务器到选择阿里云
2017-02-12 13:48

林毅洋的博客云计算中心通常需要规模化的提供以下几种类型的计算力，其服务器系统可采用三(多)层架构，一是高性能的、稳定可靠的高端计算，主要处理紧耦合计算任务，这类计算不仅包括对外的数据库、商务智能数据挖掘等关键服务，...
LCTF Web补题笔记（菜狗前进永不止步）
2017-11-20 13:52

Assassin__is__me的博客首先上来发现文件login.php和admin.php，但是没什么别的，想到文件泄露通过swp得到源码 login.php error_reporting(0); session_start(); define("METHOD", "aes-128-cbc"); include('config.php'); function
4道与CVE结合web题目
2018-02-01 12:55

蚁景网安实验室的博客走过路过，不要错过这个公众号哦！0x00 前言最近做题遇到了一些CVE的复现，其中不乏一些好题，今天介绍的是如下4个与CVE结合的题目：CVE-2017-12635(CouchDB)...
从自建服务器到选择阿里云
2017-06-03 04:51

满龙林的博客本文涉及的代码已托管在GitHub中，有兴趣的同学可以浏览有什么不懂可以浏览一下上云前序我们公司因为业务需求，需要来服务器托管微信公众号平台。之前我们先是自建服务器，然后就是...
没有解决我的问题, 去提问

preg_match在使用cURL获取数据时错过了一些ID

1条回答 默认 最新

1条回答默认最新