使用curl来抓取大页面

I'm trying to scrape comments from a popular news site for an academic study using curl. It works fine for articles with <300 comments but after that it struggles.

$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
curl_close($handle);
echo $html; //just to see what's been scraped

At the moment this page works fine: http://www.guardian.co.uk/commentisfree/2012/aug/22/letter-from-india-women-drink?commentpage=all#start-of-comments

But this one only returns 36 comments despite there being 700+ in total: http://www.guardian.co.uk/commentisfree/2012/aug/21/everyones-talking-about-rape?commentpage=all#start-of-comments

Why is it struggling for articles with a ton of comments?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongmao3131 2012-08-22 20:02
关注
You comments page is pageinated. Each page contains differerent comments. You will have to request all comment pagination links.

The parameter page=x is appended to the url for a different page.

It might be good to get base page then search for all links with page paarameter and request each of those in turn?

As Mike Christensen pointed out if you could use python and scrapy that functionality is built in. You just have to specify the element the comment is located in and python will crawl all links on the page for you:)

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

php使用curl爬取页面,json数据获取不完整 json php 有问必答
2021-08-02 16:03

回答 2 已采纳你访问的是同一个url?你爬取的是列表内容。并没有去请求详细内容
（PHP）使用Curl获取空白页面（Mytischtennis） php
2017-12-06 16:11

回答 1 已采纳 You need to set two more options in your curl request: // Add some more headers here if you need
加快cURL页面登录和抓取[重复] php
2013-05-02 18:04

回答 1 已采纳 You can avoid logging in every time by reusing your cookiejar. Create a file called cookies.txt
PHP curl实现抓取302跳转后页面的示例
2020-10-25 16:50

主要介绍了PHP curl实现抓取302跳转后页面的示例,主要是对CURLOPT_CUSTOMREQUEST参数的运用,需要的朋友可以参考下
使用CakePHP的HttpSocket或PHP的cURL从搜索页面抓取多个站点 php
2014-01-09 14:55

回答 1 已采纳 cURL is faster than any plain php implementation like the CakePHP socket is. Controller is the wr
php curl获取html和js渲染 html javascript jquery php
2015-06-25 19:41

回答 1 已采纳 No you can't. Not from PHP directly. If you have control over the server you could install phanto
如何在PHP中使用卷曲方法从亚马逊Rss Feed获取图像路径 php
2014-12-29 09:22

回答 1 已采纳 I do not see an image URL in the channel description. So no channel wide image to start with. &lt
php curl抓取网页的介绍和推广及使用CURL抓取淘宝页面集成方法
2020-10-23 05:52

抓取网页内容，分析网页数据经常使用php curl，简洁易用，本篇文章通过代码实例给大家讲解 php curl抓取网页的介绍和推广及使用CURL抓取淘宝页面集成方法，需要的朋友参考下
使用php抓取xml数据并将其打印到div时的空白页面 json php xml
2012-05-12 19:10

回答 1 已采纳 Add this line before curl_init $yql_query_url .= "&env=http://datatables.org/alltables.env"; If
获取到API的结果截取想要的内容显示到页面 php 有问必答
2022-01-04 13:06

回答 2 已采纳 php有json_decode，将字符串转对象就可以获取了，在data数组节点下。不过这个接口返回的数组顺序和标志不同注册商内容不一样。。这个有点难搞。只能通过前缀关键字来获取才行了。代码如下，如果出
PHP / HTML - 多页面屏幕抓取，导出到日期和值之间用逗号分隔的.txt html php
2011-05-28 16:12

回答 2 已采纳 All right so let's do this. We're going to first load the data into an HTML parser, then create an
PHP中使用CURL伪造来路抓取页面或文件
2021-01-20 00:55

复制代码代码如下: // 初始化 $curl = curl_init(); // 要访问的网址 curl_setopt($curl, CURLOPT_URL, ‘http://asen.me/’); // 设置来路 curl_setopt($curl, CURLOPT_REFERER, ‘http://google.com/’); // 不...
请问下面的题答案是？以及为什么？ python
2019-09-25 18:07

回答 1 已采纳 1b 2c 3cd 4cd 5a 6d 7d 8d 9c 10f 采纳后可以给你解释。
php curl抓取页面,php中curl抓取页面
2021-04-18 15:20

weixin_27727467的博客 function curl($url){$ch = curl_init();curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5000);curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);curl_setopt($ch, CURLOPT_HTTPHEADER, array(‘User-Agent: Mozilla/5.0 ...
php通过curl添加cookie伪造登陆抓取数据的方法
2020-10-22 14:32

主要介绍了php通过curl添加cookie伪造登陆抓取数据的方法,涉及PHP基于curl操作cookie及页面抓取的相关技巧,需要的朋友可以参考下
使用php方法curl抓取AJAX异步内容思路分析及代码分享
2020-10-25 12:39

但实际上呢，抓取ajax异步内容的页面和抓普通的页面区别不大。ajax只不过是做了一次异步的http请求，只要使用firebug类似的工具，找到请求的后端服务url和传值的参数，然后对该url传递参数进行抓取即可
php 抓取页面所有链接,使用curl获取页面所有链接
2021-04-11 13:32

是Allen呀的博客一般php采集网络数据会用file_get_contents、file和cURL。...今天我试试用cURL来获取网页上的所有链接。/** 使用curl 采集hao123.com下的所有链接。*/include_once('function.php');$ch = curl_init();curl_setopt($...
没有解决我的问题, 去提问

悬赏问题

¥20 BAPI_PR_CHANGE how to add account assignment information for service line
¥500 火焰左右视图、视差（基于双目相机）
¥100 set_link_state
¥15 虚幻5 UE美术毛发渲染
¥15 CVRP 图论物流运输优化
¥15 Tableau online 嵌入ppt失败
¥100 支付宝网页转账系统不识别账号
¥15 基于单片机的靶位控制系统
¥15 真我手机蓝牙传输进度消息被关闭了，怎么打开？(关键词-消息通知)
¥15 装 pytorch 的时候出了好多问题，遇到这种情况怎么处理？

使用curl来抓取大页面

1条回答 默认 最新

悬赏问题

1条回答默认最新