dongwei1954 2014-09-15 09:57
浏览 68

Symfony DomCrawler空对象

I'm trying to scrape the rating score of review sites, using Laravel 4 and the Symfony DomCrawler. Let's take this site as an example: http://estorereview.com.au/s/5951/A-Supplements I want to get the 4.8 of 5 Stars

This is partial code of my attempt:

<?php

use Symfony\Component\DomCrawler\Crawler;
use Symfony\Component\CssSelector\CssSelector;

function getRatingEstoreReview($url){
  $html = getHtmlCurl($url);
  $crawler = new Crawler($html);
  $crawler = $crawler->filter('span[itemprop="ratingValue"]'); 
  var_dump($crawler);
  die("test");
  return normalize($crawler,5);
}

The var_dump returns following:

object(Symfony\Component\DomCrawler\Crawler)[280]
  protected 'uri' => null
  private 'defaultNamespacePrefix' => string 'default' (length=7)
  private 'namespaces' => 
    array (size=0)
      empty

I tried this with other sites etc. but I'm always getting an empty object. Accessing the value with $crawler->first doesn't work as well.

What am I doing wrong? Thank you.

Edit: Even if I'm filtering for "div" the Crawler remains empty. The PHP Simple HTML DOM Parser works fine

  • 写回答

1条回答 默认 最新

  • dongmaopan5738 2014-09-16 12:09
    关注

    The full CSS path for that element is body > div:nth-child(3) > div > div > div.left-container.floatl > div.top > div.top-inner > div.store-rating-container.floatl > div.star-col.floatl.overall-rating-stars > div.rating-text.floatl > div > strong > span. Have you tried using that as your filter term instead?

    You can also use filterXPath() instead, in which case you're looking for /html/body/div[3]/div/div/div[4]/div[1]/div[2]/div[2]/div[1]/div[2]/div/strong/span.

    Edit: it doesn't look like it applies to this specific page, but just wanted to mention a "gotcha" for web crawling. Remember that for some web pages, the contents will have been manipulated (post-load) by JavaScript. In that case, the elements you're looking for may not be seen by DomCrawler at all.

    Update:

    Here are the results I see. I'm using Goutte rather than getHtmlCurl().

    Code:

    use Goutte\Client;
    use Symfony\Component\DomCrawler\Crawler;
    
    $client = new Client();
    $crawler = $client->request('GET', 'http://estorereview.com.au/s/5951/A-Supplements');
    var_dump($crawler->filter('span[itemprop="ratingValue"]')); 
    echo $crawler->filter('span[itemprop="ratingValue"]')->text();
    die("<br />test completed");
    

    Output:

    object(Symfony\Component\DomCrawler\Crawler)[177]
      protected 'uri' => string 'http://estorereview.com.au/s/5951/A-Supplements' (length=47)
      private 'defaultNamespacePrefix' => string 'default' (length=7)
      private 'namespaces' => 
        array (size=0)
          empty
    4.8
    test completed
    

    So, that works.

    评论

报告相同问题?

悬赏问题

  • ¥15 下图接收小电路,谁知道原理
  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度
  • ¥30 关于#r语言#的问题:如何对R语言中mfgarch包中构建的garch-midas模型进行样本内长期波动率预测和样本外长期波动率预测
  • ¥15 ETLCloud 处理json多层级问题
  • ¥15 matlab中使用gurobi时报错
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探