dpbsy60000 2015-11-12 22:30
浏览 38
已采纳

在PHP中为内容刮取DOMDocument表

I am really struggling attempting to scrape a table either via XPath or any sort of 'getElement' method. I have searched around and attempted various different approaches to solve my problem below but have come up short and really appreciate any help.

First, the HTML portion I am trying to scrape is the 2nd table on the document and looks like:

<table class="table2" border="1" cellspacing="0" cellpadding="3">
<tbody>
<tr><th colspan="8" align="left">Status Information</th></tr>
<tr><th align="left">Status</th><th align="left">Type</th><th align="left">Address</th><th align="left">LP</th><th align="left">Agent Info</th><th align="left">Agent Email</th><th align="left">Phone</th><th align="center">Email Tmplt</th></tr>
<tr></tr>
<tr>
<td align="left">Active</td>
<td align="left">Resale</td>
<td align="center">*Property Address*</td>
<td align="right">*Price*</td>
<td align="center">*Agent Info*</td>
<td align="center">*Agent Email*</td>
<td align="center">*Agent Phone*</td>
<td align="center">&nbsp;</td>
</tr>
<tr>
<td align="left">Active</td>
<td align="left">Resale</td>
<td align="center">*Property Address*</td>
<td align="right">*Price*</td>
<td align="center">*Agent Info*</td>
<td align="center">*Agent Email*</td>
<td align="center">*Agent Phone*</td>
<td align="center">&nbsp;</td>
</tr>
...etc

With additional trs continuing containing 8 tds with the same information as detailed above.

What I need to do is iterate through the trs and internal tds to pick up each piece of information (inside the td) for each entry (inside of the tr).

Here is the code I have been struggling with:

<?php

$payload = array(
  'http'=>array(
     'method'=>"POST",
     'content'=>'key=value'
   )
);
stream_context_set_default($payload);
$dom = new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile('website-scraping-from.com');
libxml_clear_errors();

foreach ($dom->getElementsByTagName('tr') as $row){
    foreach($dom->$row->getElementsByTagName('td') as $node){
        echo $node->textContent . "<br/>";
    }

}


?>

This code is not returning nearly what I need and I am having a lot of trouble trying to figure out how to fix it, perhaps XPath is a better route to go to find the table / information I need, but I have come up empty with that method as well. Any information is much appreciated.

If it matters, my end goal is to be able to take the table data and dump it into a database if the first td has a value of "Active".

  • 写回答

1条回答 默认 最新

  • duan1979768678 2015-11-12 22:51
    关注

    Can this be of any help?

    $table = $dom->getElementsByTagName('table')->item(1);
    foreach ($table->getElementsByTagName('tr') as $row){
        $cells = $row->getElementsByTagName('td');
        if ( $cells->item(0)->nodeValue == 'Active' ) {
            foreach($cells as $node){
                echo $node->nodeValue . "<br/>";
            }
        }
    }
    

    This will fetch the second table, and display the contents of the rows starting with a first cell "Active".

    Edit: Here is a more extensive help:

    $arr = array();
    $table = $dom->getElementsByTagName('table')->item(1);
    foreach ($table->getElementsByTagName('tr') as $row){
        $cells = $row->getElementsByTagName('td');
        if ( $cells->item(0)->nodeValue == 'Active' ) {
            $obj = new stdClass;
            $obj->type    = $cells->item(1)->nodeValue;
            $obj->address = $cells->item(2)->nodeValue;
            $obj->price   = $cells->item(3)->nodeValue;
            $obj->agent   = $cells->item(4)->nodeValue;
            $obj->email   = $cells->item(5)->nodeValue;
            $obj->phone   = $cells->item(6)->nodeValue;
            array_push( $arr, $obj );
        }
    }
    print_r( $arr );
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?