dongmi1864 2011-11-18 04:46
浏览 32
已采纳

PHP网络爬虫,数据结构和存储,它是否可以与PHPCrawl一起使用?

If there are other classes written to do this, a link would be awesome. If not, how can I do it with PHPCrawl?

Is it possible to store specific information from a crawled site based upon a set of rules specific to the site? Ex., [div.wantThis, img#defaultPicture] is the array returned for site A and only [div.shortTextContent] is the array returned for site B?

In PHPCrawl, how can I get this information out of the $page_data array?

Needs

Must be able to target only certain elements.

Able to read the data storage rule from a variable (which could be an array specifying the element(s) to target).

  • 写回答

1条回答 默认 最新

  • donglilian0061 2011-11-29 09:30
    关注

    What you are asking is how to parse specific content from site A and some other specific content from site B using PHPCrawl.

    For site specific parsing style following if-else approach can be followed:

    for url in urls:
        content = crawl(url)
        if(url of type 1?):
            extract_style1(content)
        else-if(url of type 2?):
            extract_style2(content)
        else:
            extract_styledefault(content)
    


    For specific content extracting following algo can be used:

    Note: There are spectrum of parsing techniques avaliable, I am implmeneting HTML DOM Parsing here..

    // Create DOM from your PHP Crawl Data Source
    $html = $page_data[source]
    
    // Find all images 
    foreach($html->find('img') as $element) 
           echo $element->src . '<br>';
    
    // Find all links 
    foreach($html->find('a') as $element) 
           echo $element->href . '<br>';
    

    Reference:

    HTML DOM
    PHPCrawl Example

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 用hfss做微带贴片阵列天线的时候分析设置有问题
  • ¥50 我撰写的python爬虫爬不了 要爬的网址有反爬机制
  • ¥15 Centos / PETSc / PETGEM
  • ¥15 centos7.9 IPv6端口telnet和端口监控问题
  • ¥120 计算机网络的新校区组网设计
  • ¥20 完全没有学习过GAN,看了CSDN的一篇文章,里面有代码但是完全不知道如何操作
  • ¥15 使用ue5插件narrative时如何切换关卡也保存叙事任务记录
  • ¥20 海浪数据 南海地区海况数据,波浪数据
  • ¥20 软件测试决策法疑问求解答
  • ¥15 win11 23H2删除推荐的项目,支持注册表等