weixin_33695082 2011-04-20 17:51 采纳率: 0%
浏览 16

所以,我想抓取网页吗? [重复]

This question already has answers here:
                </div>
            </div>
            <div class="grid--cell mb0 mt8">Closed <span title="2011-04-20 18:06:15Z" class="relativetime">9 years ago</span>.</div>
        </div>
    </aside>

Possible Duplicates:
How to write a crawler?
Best methods to parse HTML

I've always wondered how to do something like this. I am not the owner/admin/webmaster of the site (http://poolga.com/) however the information I wish to obtain is publicly available. This page here (http://poolga.com/artists) is a directory of all of the artist that have contributed to the site. However the links on this page go to another page which contains this anchor tag which contains the link to the artist actual website.

<a id="author-url" class="helv" target="_blank" href="http://aaaghr.com/">http://aaaghr.com/</a>

I hate having to command + click the links in the directory and then click the link to the artists website. I would love a way to have a batch of 10 of the artist website links appear as tabs in the browse just for temporary viewing. However just getting these href's into some-sort of array would be a feat itself. Any idea or direction / google searches within any programming language is great! Would this even be referred to as "crawling"? Thanks for reading!

UPDATE

I used Simple HTML DOM on my local php MAMP server with this script, took a little while!

$artistPages = array();
foreach(file_get_html('http://poolga.com/artists')->find('div#artists ol li a') as $element){
  array_push($artistPages,$element->href);
}

for ($counter = 0; $counter <= sizeof($artistPages)-1; $counter += 1) {
    foreach(file_get_html($artistPages[$counter])->find('a#author-url') as $element){
           echo $element->href . '<br>';
    }
}
</div>
  • 写回答

2条回答 默认 最新

  • weixin_33713503 2011-04-20 18:02
    关注

    My favourite php library for navigating through the dom is Simple HTML DOM.

    set_time_limit(0);
    $poolga = file_get_html('http://poolga.com/artists');
    $inRefs = $poolga->find('div#artists ol li a');
    $links = array();
    
    foreach ($inRefs as $ref) {
        $site = file_get_html($ref->href);
        $links[] = $site->find('a#author-url', 0)->href;
    }
    
    print_r($links);
    

    Code, I think, is pretty self-explanatory.

    Edit: Had a spelling mistake. It would take the script a really, really long time to finish, seeing as how there are so many links; that's why I used set_time_limit(). Go do other stuff and let the script run.

    评论

报告相同问题?

悬赏问题

  • ¥15 slam rangenet++配置
  • ¥15 对于相关问题的求解与代码
  • ¥15 ubuntu子系统密码忘记
  • ¥15 信号傅里叶变换在matlab上遇到的小问题请求帮助
  • ¥15 保护模式-系统加载-段寄存器
  • ¥15 电脑桌面设定一个区域禁止鼠标操作
  • ¥15 求NPF226060磁芯的详细资料
  • ¥15 使用R语言marginaleffects包进行边际效应图绘制
  • ¥20 usb设备兼容性问题
  • ¥15 错误(10048): “调用exui内部功能”库命令的参数“参数4”不能接受空数据。怎么解决啊