weixin_33695082 2011-04-20 17:51 采纳率: 0%
浏览 16

所以,我想抓取网页吗? [重复]

This question already has answers here:
                </div>
            </div>
            <div class="grid--cell mb0 mt8">Closed <span title="2011-04-20 18:06:15Z" class="relativetime">9 years ago</span>.</div>
        </div>
    </aside>

Possible Duplicates:
How to write a crawler?
Best methods to parse HTML

I've always wondered how to do something like this. I am not the owner/admin/webmaster of the site (http://poolga.com/) however the information I wish to obtain is publicly available. This page here (http://poolga.com/artists) is a directory of all of the artist that have contributed to the site. However the links on this page go to another page which contains this anchor tag which contains the link to the artist actual website.

<a id="author-url" class="helv" target="_blank" href="http://aaaghr.com/">http://aaaghr.com/</a>

I hate having to command + click the links in the directory and then click the link to the artists website. I would love a way to have a batch of 10 of the artist website links appear as tabs in the browse just for temporary viewing. However just getting these href's into some-sort of array would be a feat itself. Any idea or direction / google searches within any programming language is great! Would this even be referred to as "crawling"? Thanks for reading!

UPDATE

I used Simple HTML DOM on my local php MAMP server with this script, took a little while!

$artistPages = array();
foreach(file_get_html('http://poolga.com/artists')->find('div#artists ol li a') as $element){
  array_push($artistPages,$element->href);
}

for ($counter = 0; $counter <= sizeof($artistPages)-1; $counter += 1) {
    foreach(file_get_html($artistPages[$counter])->find('a#author-url') as $element){
           echo $element->href . '<br>';
    }
}
</div>
  • 写回答

2条回答 默认 最新

  • weixin_33713503 2011-04-20 18:02
    关注

    My favourite php library for navigating through the dom is Simple HTML DOM.

    set_time_limit(0);
    $poolga = file_get_html('http://poolga.com/artists');
    $inRefs = $poolga->find('div#artists ol li a');
    $links = array();
    
    foreach ($inRefs as $ref) {
        $site = file_get_html($ref->href);
        $links[] = $site->find('a#author-url', 0)->href;
    }
    
    print_r($links);
    

    Code, I think, is pretty self-explanatory.

    Edit: Had a spelling mistake. It would take the script a really, really long time to finish, seeing as how there are so many links; that's why I used set_time_limit(). Go do other stuff and let the script run.

    评论

报告相同问题?

悬赏问题

  • ¥100 Jenkins自动化部署—悬赏100元
  • ¥15 关于#python#的问题:求帮写python代码
  • ¥20 MATLAB画图图形出现上下震荡的线条
  • ¥15 关于#windows#的问题:怎么用WIN 11系统的电脑 克隆WIN NT3.51-4.0系统的硬盘
  • ¥15 perl MISA分析p3_in脚本出错
  • ¥15 k8s部署jupyterlab,jupyterlab保存不了文件
  • ¥15 ubuntu虚拟机打包apk错误
  • ¥199 rust编程架构设计的方案 有偿
  • ¥15 回答4f系统的像差计算
  • ¥15 java如何提取出pdf里的文字?