dongliqin6939 2012-08-22 19:58
浏览 58

使用curl来抓取大页面

I'm trying to scrape comments from a popular news site for an academic study using curl. It works fine for articles with <300 comments but after that it struggles.

$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
curl_close($handle);
echo $html; //just to see what's been scraped

At the moment this page works fine: http://www.guardian.co.uk/commentisfree/2012/aug/22/letter-from-india-women-drink?commentpage=all#start-of-comments

But this one only returns 36 comments despite there being 700+ in total: http://www.guardian.co.uk/commentisfree/2012/aug/21/everyones-talking-about-rape?commentpage=all#start-of-comments

Why is it struggling for articles with a ton of comments?

  • 写回答

1条回答 默认 最新

  • dongmao3131 2012-08-22 20:02
    关注

    You comments page is pageinated. Each page contains differerent comments. You will have to request all comment pagination links.

    The parameter page=x is appended to the url for a different page.

    It might be good to get base page then search for all links with page paarameter and request each of those in turn?

    As Mike Christensen pointed out if you could use python and scrapy that functionality is built in. You just have to specify the element the comment is located in and python will crawl all links on the page for you:)

    评论

报告相同问题?

悬赏问题

  • ¥20 BAPI_PR_CHANGE how to add account assignment information for service line
  • ¥500 火焰左右视图、视差(基于双目相机)
  • ¥100 set_link_state
  • ¥15 虚幻5 UE美术毛发渲染
  • ¥15 CVRP 图论 物流运输优化
  • ¥15 Tableau online 嵌入ppt失败
  • ¥100 支付宝网页转账系统不识别账号
  • ¥15 基于单片机的靶位控制系统
  • ¥15 真我手机蓝牙传输进度消息被关闭了,怎么打开?(关键词-消息通知)
  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?