所以，我想抓取网页吗？ [重复]

This question already has answers here:

                </div>
            </div>
            <div class="grid--cell mb0 mt8">Closed <span title="2011-04-20 18:06:15Z" class="relativetime">9 years ago</span>.</div>
        </div>
    </aside>

Possible Duplicates:
How to write a crawler?
Best methods to parse HTML

I've always wondered how to do something like this. I am not the owner/admin/webmaster of the site (http://poolga.com/) however the information I wish to obtain is publicly available. This page here (http://poolga.com/artists) is a directory of all of the artist that have contributed to the site. However the links on this page go to another page which contains this anchor tag which contains the link to the artist actual website.

<a id="author-url" class="helv" target="_blank" href="http://aaaghr.com/">http://aaaghr.com/</a>

I hate having to command + click the links in the directory and then click the link to the artists website. I would love a way to have a batch of 10 of the artist website links appear as tabs in the browse just for temporary viewing. However just getting these href's into some-sort of array would be a feat itself. Any idea or direction / google searches within any programming language is great! Would this even be referred to as "crawling"? Thanks for reading!

UPDATE

I used Simple HTML DOM on my local php MAMP server with this script, took a little while!

$artistPages = array();
foreach(file_get_html('http://poolga.com/artists')->find('div#artists ol li a') as $element){
  array_push($artistPages,$element->href);
}

for ($counter = 0; $counter <= sizeof($artistPages)-1; $counter += 1) {
    foreach(file_get_html($artistPages[$counter])->find('a#author-url') as $element){
           echo $element->href . '<br>';
    }
}

</div>

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
weixin_33713503 2011-04-20 18:02
关注
My favourite php library for navigating through the dom is Simple HTML DOM.

set_time_limit(0); $poolga = file_get_html('http://poolga.com/artists'); $inRefs = $poolga->find('div#artists ol li a'); $links = array(); foreach ($inRefs as $ref) { $site = file_get_html($ref->href); $links[] = $site->find('a#author-url', 0)->href; } print_r($links);

Code, I think, is pretty self-explanatory.

Edit: Had a spelling mistake. It would take the script a really, really long time to finish, seeing as how there are so many links; that's why I used set_time_limit(). Go do other stuff and let the script run.
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

如何从网页上抓取我想要的图片? 爬虫
2015-09-10 08:49

回答 1 已采纳直接用IE F12，找到图片元素，得到地址，就可以抓。也可以另存为网页，然后在文件夹中找。
为什么用selenium抓取不到淘宝网页数据？ chrome python selenium 有问必答
2022-01-24 12:19

回答 2 已采纳 xpath写错了
python 如何抓取网页地址保存为图片？ python
2021-11-09 13:05

回答 1 已采纳 PIL ,或者pyautogui
SEO是什么？前端如何进行SEO优化
2021-11-13 22:54

万物之恋的博客前端如何进行SEO优化 SEO是什么？ seo又称网站优化，也称搜索引擎优化，英文名（Search Engine Optimization），简称：seo。 seo是一种基础搜索引擎的网络营销推广方式，通过搜索引擎平台的规则来优化，以实现产品...
怎么从html上正确抓取数据呀？ python
2023-03-26 16:21

回答 4 已采纳用xpath就够了，不需要parsel框架，多余了。
百度阅读改怎么抓取目录呢？ python
2021-06-05 09:47

回答 1 已采纳他的目录就在文档里，不过做了unicode编码，在第140行，bookInfo['catalogs']后边就是目录信息了，你需要把他用ascii码和汉字替换掉里面的内容用js的eval这个内容，就
requests 抓取网页信息为什么获取不到信息？ python
2018-10-25 07:51

回答 2 已采纳这里有大佬详细分析和图解抓取方法https://www.cnblogs.com/nan86150/p/4272452.html
学python做前端合适吗_为什么我建议你一定要学Python？
2020-12-19 01:52

weixin_39788703的博客上周刷到表妹小艺发的朋友圈：我惊呆了，她，二本，进互联网大厂？私聊才知道，原来小艺在学校找兼职和实习的时候会上招聘网站。她发现很多互联网公司的招聘中都写了“会Python优先”。于是每天抽出半小时开始学习...
如何分析并抓取一个网页满足特定日期条件的所有数据？ python
2022-06-12 17:23

回答 1 已采纳抓取网页数据的几种方法_LiZhen798的博客-CSDN博客_网页数据抓取相信所有个人网站的站长都有抓取别人数据的经历吧，目前抓取别人
我想问一下为什么我这个程序只能爬到第一页的数据呀，怎么改才能抓取全部呢？ python
2022-04-15 16:40

回答 1 已采纳这个时候你就要通过自己的观察，页面怎么跳到下一页，我看着你这个，url最后有一个=1估计就是指的第一页，那就=2就是第2页，做一个循环然后动态改变=n，然后不就可以想爬几页，爬几页，这只是假设，你要去
为什么'www。'网址前缀会影响cURL是否可以抓取网页内容？ php
2016-06-13 15:29

回答 3 已采纳 http://www.gyngen.dk redirects to http://gyngen.dk. Your browser follows the redirect transparent
VB.net WebBrowser网页元素抓取分析方法
2023-11-16 15:06

zslefour的博客在用WebBrowser编程实现网页操作自动化时，常要分析网页Html，例如网页在加载数据时，常会显示“系统处理中，请稍候..”，我们需要在数据加载完成后才能继续下一步操作，如何抓取这个信息的网页html元素变化，从而...
如何用phyton抓取网页信息 python 有问必答自动化运维
2022-03-28 12:23

回答 3 已采纳使用selenium模拟浏览器操作和获取网页信息，处理提示、警告和确认框等，参考代码： from selenium import webdriver driver = webdriver.Chrom
服务器处理蜘蛛抓取网页的过程,让你网站快速被蜘蛛抓取的十三个方法
2021-08-13 04:31

weixin_39614657的博客下面就跟大家说说怎么让蜘蛛快速抓取的方法。一、网站及页面权重。这个肯定是首要的了，权重高、资格老、有权威的网站蜘蛛是肯定特殊对待的，这样的网站抓取的频率非常高，而且大家知道搜索引擎蜘蛛为了保证高...
3种网页抓取方法
2019-04-28 20:39

人邮异步社区的博客 3种抓取其中数据的方法。首先是正则表达式，然后是流行的BeautifulSoup模块，最后是强大的lxml模块。 1　正则表达式如果你对正则表达式还不熟悉，或是需要一些提示，那么你可以查阅...
没有解决我的问题, 去提问

悬赏问题

¥15 slam rangenet++配置
¥15 对于相关问题的求解与代码
¥15 ubuntu子系统密码忘记
¥15 信号傅里叶变换在matlab上遇到的小问题请求帮助
¥15 保护模式-系统加载-段寄存器
¥15 电脑桌面设定一个区域禁止鼠标操作
¥15 求NPF226060磁芯的详细资料
¥15 使用R语言marginaleffects包进行边际效应图绘制
¥20 usb设备兼容性问题
¥15 错误(10048): “调用exui内部功能”库命令的参数“参数4”不能接受空数据。怎么解决啊

所以，我想抓取网页吗？ [重复]

2条回答 默认 最新

悬赏问题

2条回答默认最新