du7979 2013-08-29 10:42
浏览 43

合并DOM查询和file_get_contents

I have researched this quite a bit over the last few days, and I have found all the answers online for the various functions, so thank you.

I now have 3 separate bits of code that all grab the contents of a webpage (the page would be an e-commerce product page, review page, something with a product on it) to get different information, but I am assuming this is very inefficient grabbing the contents 3 times!

The 3 bits of code do the 3 following things: 1) Get the webpage Title 2) Get all the images from a page 3) Find figures to get (what is hopefully) the price of the item on that page.

I would appreciate some help to group these together so it only has to get the file contents once. This is my current code: 1st Time:

function getDetails($Url){
    $str = file_get_contents($Url);
    if(strlen($str)>0){
        //preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
        //The above didnt work well enough (for getting Title when <title id=... > etc) so used the DOM below



            preg_match("/(\£[0-9]+(\.[0-9]{2})?)/",$str,$price); //£ for GBP
            $priceRes = preg_replace("/[^0-9,.]/", "", $price[0]);

            //$pageDeatil[0]=$title;
            $pageDeatil[1]=$priceRes;
            return $pageDeatil;

    }
}

$pageDeatil = getDetails("$newItem_URL");
//$itemTitle = $pageDeatil[0];
$itemPrice = $pageDeatil[1];

2nd Time:

$doc = new DOMDocument();
@$doc->loadHTMLFile("$newItem_URL");
$xpath = new DOMXPath($doc);
$itemTitle = $xpath->query('//title')->item(0)->nodeValue."
";

3rd Time:

include('../../code/simplehtmldom/simple_html_dom.php');
include('../../code/url_to_absolute/url_to_absolute.php');

$html = file_get_html($newItem_URL);
foreach($html->find('img') as $e){

$imgURL =  url_to_absolute($url, $e->src);
    //More code here

}

I cant seem to get the file once then use just that throughout the rest. Any help would be appreciated! Thanks in advance.

  • 写回答

1条回答 默认 最新

  • dqu92800 2013-08-29 11:40
    关注

    I prefer using cURL when scraping sites. Your price fetching code doesn't seem to be particularly efficient either, I think you should use XPath there as well. The return of the function could be an object with price, title and an array of images.

    function get_details($url) {
       $ch = curl_init($url);
       curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
       curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
       curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
    
       $html = curl_exec($ch);
    
       $dom = new DOMDocument();
       @$dom->loadHTML($html);
       $xpath = new DOMXPath($dom);
    
       $product         = new stdClass;
       $product->title  = $xpath->query('//title')->item(0)->nodeValue;
       $product->price  = // price query goes here
       $product->images = array();
    
       foreach($xpath->query('//img') as $image) {
          $product->images[] = $image->getAttribute('src');
       }
    
       return $product;
    }
    
    评论

报告相同问题?

悬赏问题

  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么
  • ¥15 banner广告展示设置多少时间不怎么会消耗用户价值
  • ¥16 mybatis的代理对象无法通过@Autowired装填
  • ¥15 可见光定位matlab仿真
  • ¥15 arduino 四自由度机械臂
  • ¥15 wordpress 产品图片 GIF 没法显示
  • ¥15 求三国群英传pl国战时间的修改方法
  • ¥15 matlab代码代写,需写出详细代码,代价私
  • ¥15 ROS系统搭建请教(跨境电商用途)