dongmi1864 2011-11-18 04:46
浏览 32
已采纳

PHP网络爬虫,数据结构和存储,它是否可以与PHPCrawl一起使用?

If there are other classes written to do this, a link would be awesome. If not, how can I do it with PHPCrawl?

Is it possible to store specific information from a crawled site based upon a set of rules specific to the site? Ex., [div.wantThis, img#defaultPicture] is the array returned for site A and only [div.shortTextContent] is the array returned for site B?

In PHPCrawl, how can I get this information out of the $page_data array?

Needs

Must be able to target only certain elements.

Able to read the data storage rule from a variable (which could be an array specifying the element(s) to target).

  • 写回答

1条回答 默认 最新

  • donglilian0061 2011-11-29 09:30
    关注

    What you are asking is how to parse specific content from site A and some other specific content from site B using PHPCrawl.

    For site specific parsing style following if-else approach can be followed:

    for url in urls:
        content = crawl(url)
        if(url of type 1?):
            extract_style1(content)
        else-if(url of type 2?):
            extract_style2(content)
        else:
            extract_styledefault(content)
    


    For specific content extracting following algo can be used:

    Note: There are spectrum of parsing techniques avaliable, I am implmeneting HTML DOM Parsing here..

    // Create DOM from your PHP Crawl Data Source
    $html = $page_data[source]
    
    // Find all images 
    foreach($html->find('img') as $element) 
           echo $element->src . '<br>';
    
    // Find all links 
    foreach($html->find('a') as $element) 
           echo $element->href . '<br>';
    

    Reference:

    HTML DOM
    PHPCrawl Example

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥20 腾讯企业邮箱邮件可以恢复么
  • ¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗?
  • ¥15 错误 LNK2001 无法解析的外部符号
  • ¥50 安装pyaudiokits失败
  • ¥15 计组这些题应该咋做呀
  • ¥60 更换迈创SOL6M4AE卡的时候,驱动要重新装才能使用,怎么解决?
  • ¥15 让node服务器有自动加载文件的功能
  • ¥15 jmeter脚本回放有的是对的有的是错的
  • ¥15 r语言蛋白组学相关问题
  • ¥15 Python时间序列如何拟合疏系数模型