网络爬虫如何工作？

Using some basic website scraping, I am trying to prepare a database for price comparison which will ease users' search experiences. Now, I have several questions:

Should I use file_get_contents() or curl to get the contents of the required web page?

$link = "http://xyz.com";
$res55 = curl_init($link);
curl_setopt ($res55, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($res55, CURLOPT_FOLLOWLOCATION, true); 
$result = curl_exec($res55);

Further, every time I crawl a web page, I fetch a lot of links to visit next. This may take a long time (days if you crawl big websites like Ebay). In that case, my PHP code will time-out. What should be the automated way to do this? Is there a way to prevent PHP from timing out by making changes on the server, or is there another solution?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

5条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douji6896 2012-08-06 19:23
关注
So, in that case my PHP code will time-out and it won't continue that long.

Are you doing this in the code that's driving your web page? That is, when someone makes a request, are you crawling right then and there to build the response? If so, then yes there is definitely a better way.

If you have a list of the sites you need to crawl, you can set up a scheduled job (using cron for example) to run a command-line application (not a web page) to crawl the sites. At that point you should parse out the data you're looking for and store it in a database. Your site would then just need to point to that database.

This is an improvement for two reasons:

Performance

Code Design

Performance: In a request/response system like a web site, you want to minimize I/O bottlenecks. The response should take as little time as possible. So you want to avoid in-line work wherever possible. By offloading this process to something outside the context of the website and using a local database, you turn a series of external service calls (slow) to a single local database call (much faster).

Code Design: Separation of concerns. This setup modularizes your code a little bit more. You have one module which is in charge of fetching the data and another which is in charge of displaying the data. Neither of them should ever need to know or care about how the other accomplishes its tasks. So if you ever need to replace one (such as finding a better scraping method) you won't also need to change the other.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(4条)

报告相同问题？

关注问题

如何解决python爬虫问题？ python 人工智能爬虫
2022-08-15 09:11

回答 1 已采纳应该是css选择器里面的规则不够明确，可改成href = selectors.css('div.container div div div ul li a::attr(href)').getall()
Python爬虫可行？？？ python 有问必答
2021-07-06 07:10

回答 2 已采纳不登陆的情况下不可以爬数据，网站会自动验证登录状态，你可以cookies去做，实际上也是已经登录了用户。如果跳过登录直接拿数据，就是入侵了，一般技术是做不到，而且难度大也是违法。如果对你有帮助，可以点
网络爬虫如何工作？ php
2012-08-06 19:14

回答 5 已采纳 So, in that case my PHP code will time-out and it won't continue that long. Are you doing thi
网络爬虫php
2017-12-27 22:51

基于微信开发制作的php网络爬虫，就是通过php网络爬虫技术实现在微信公众号获取网站信息
PHP如何采集指定的数据(爬虫)？ php
2021-07-08 12:54

回答 1 已采纳请求头中有个必须的参数：hexin-v，并且它有一定时效性，大概20分钟就会失效，这个需要继续研究。其他都是一些正常查询参数 $client = new \GuzzleHttp\Client([
在有标头的情况下，服务器是怎么区分浏览器和爬虫的？ python 有问必答爬虫
2021-11-02 20:35

回答 1 已采纳还有cookie，比如网站使用了session，会生成对应的cookie发送到客户端，如asp.net的ASP.NET_SessionId，php和jsp也有对应的cookie。要保持是同一个请求，
学习网络爬虫有风险吗 python 学习方法爬虫
2023-01-22 10:36

回答 4 已采纳 1、初学者好好学就是了，基本不可能爬得到什么有风险的东西，因为你技术不够2、就算有了一定技术爬到了有一定风险的东西，基本上不盈利/不把人网站搞崩也没啥事3、学到后面知道的知识多了，自然也不会有什么所谓
php编写的网络爬虫
2015-04-27 07:53

这是用php编写的一个简单的网络爬虫程序，运行可以从一个网站获取所需的所有url内容，对于初学者有很大的帮助！
scrapy如何手动停止爬虫？ python
2021-05-10 09:54

回答 1 已采纳 Ctrl+C 只是终止主线程,你的其他线程没有守护,所以 Ctrl+C 后它们继续运行。另外scrapy中的 Ctrl+C 是暂停，并不是完全停止，Ctrl+C 是断点续爬的基础。
爬虫工程师的工作流程 python selenium 爬虫
2022-08-15 14:32

回答 2 已采纳一般来讲就是爬虫工程师只对数据进行一个简单的清洗和过滤，解析出我们需要的所有字段，并做去重处理即可后续对数据进行加工和价值挖掘有专业人员入手，比如算法工程师，数仓工程师等等术业有专攻，爬虫工程师的工作
python网络爬虫访问httpbin flask python 爬虫
2022-04-10 22:54

回答 1 已采纳 import requests print(requests.get('http://httpbin.org/get').json()) 服务器在漂亮国
spiderino:PHP 网络爬虫
2021-06-21 02:21

用 PHP 编写的网络爬虫 ###什么是Spiderino Spiderino 是一种用 PHP 编写的网络爬虫，它接收输入一个或多个 URL 种子、一个或多个关键字、扫描种子和其他 URL，并保存几乎包含一个输入关键字的页面（或文件）。 ##...
python网络爬虫 python 有问必答
2021-06-23 17:45

回答 2 已采纳建议参考文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html 都是中文，很好理解如果对你有帮助，可以点击我
php网络爬虫中文语言包
2016-02-23 17:35

phpdig网络爬虫的中文语言包，复制到语言目录即可实现phpdig的汉化
什么是网络爬虫？有哪些作用？如何构建？
2020-02-01 14:24

濯一一的博客网络爬虫是一种互联网机器人，它通过爬取互联网上网站的内容来工作。它是用计算机语言编写的程序或脚本，用于自动从Internet上获取任何信息或数据。机器人扫描并抓取每个所需页面上的某些信息，直到处理完所有能...
没有解决我的问题, 去提问

悬赏问题

¥50 有数据，怎么建立模型求影响全要素生产率的因素
¥50 有数据，怎么用matlab求全要素生产率
¥15 TI的insta-spin例程
¥15 完成下列问题完成下列问题
¥15 C#算法问题, 不知道怎么处理这个数据的转换
¥15 YoloV5 第三方库的版本对照问题
¥15 请完成下列相关问题！
¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像，如何解决？
¥15 求daily translation（DT）偏差订正方法的代码
¥15 js调用html页面需要隐藏某个按钮

网络爬虫如何工作？

5条回答 默认 最新

悬赏问题

5条回答默认最新