dsjswclzh40259075 2012-04-24 09:58
浏览 24
已采纳

从网站上多个页面上发生的DIV中提取文本,然后输出到.txt?

Just to note from the start, the content is uncopyrighted and I would like to automate the process of acquiring the text for the purpose of a project.

I'd like to extract the text from a particular and recurring DIV (that is attributed with it's own 'class', in case that makes it easier) sitting in each page on a simply designed website.

There is a single archive page on the site with a list of all of the pages containing the content I would like.

The site is www.zenhabits.net

I imagine this could be achieved with some sort of script, but have no idea where to start.

I appreciate any help.

-Nathan.

  • 写回答

2条回答 默认 最新

  • dongpao1905 2012-04-24 13:19
    关注

    This is pretty straight forward.

    Firstly, get all the links from this site, and throw them all into an array:

    set_time_limit(0);//this could take a while...
    
    ignore_user_abort(true);//in case browser times out
    
    
    $html_output=file_get_contents("http://zenhabits.net/archives/");
    
    # -- Do a preg_match on the html, and grab all links:
    if(preg_match_all('/<a href=\"http:\/\/zenhabits.net\/(.*)\">/',$html_output,$matches)) {
    # -- Append Data To Array
    foreach($matches[1] as $secLink) {  
        $links[] = "http://zenhabits.net/".$secLink;
    }
        }
    

    I tested this for you, and:

    //first 3 are returning something weird, but you don't need them - so I shall remove them xD
    unset($links[0]);
    unset($links[1]);
    unset($links[2]);
    

    No that's all done, time to go through all of THOSE links (in the array $links), and take its content:

    foreach($links as $contLink){
    
    $html_output_c=file_get_contents("$contLink");
    
    
        if(preg_match('|<div class=\"post\">(.*)</div>|s',$html_output_c,$c_matches)) {
        # -- Append Data To Array   
    echo"data found <br>";
        $contentFromPage[] = $c_matches[1];
        }
    else{echo "no content found in: $contLink -- <br><br><br>";}
    }//end of foreach
    

    I've basically just written a whole crawler script for you..

    And now, loop the content array, and do whatever you want with it(here we shall put it into a text file):

    //$contentFromPage now contains all of div class="post" content (in an array) - so do what you want with it
    
        foreach($contentFromPage as $content){
    
        # -- We need a name for each text file --
    $textName=rand()."_content_".rand().".txt";//we'll just use some numbers and text
    
    //define file path (where you want the txt file to be saved)
    $path="../";//we'll just put it in a folder above the script
    $full_path=$path.$textName; 
    
    // now save the file..
    
    file_put_contents($full_path,$content);
    
    //and that's it
    
        }//end of foreach
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?