dpbsy60000 2015-11-12 22:30
浏览 37
已采纳

在PHP中为内容刮取DOMDocument表

I am really struggling attempting to scrape a table either via XPath or any sort of 'getElement' method. I have searched around and attempted various different approaches to solve my problem below but have come up short and really appreciate any help.

First, the HTML portion I am trying to scrape is the 2nd table on the document and looks like:

<table class="table2" border="1" cellspacing="0" cellpadding="3">
<tbody>
<tr><th colspan="8" align="left">Status Information</th></tr>
<tr><th align="left">Status</th><th align="left">Type</th><th align="left">Address</th><th align="left">LP</th><th align="left">Agent Info</th><th align="left">Agent Email</th><th align="left">Phone</th><th align="center">Email Tmplt</th></tr>
<tr></tr>
<tr>
<td align="left">Active</td>
<td align="left">Resale</td>
<td align="center">*Property Address*</td>
<td align="right">*Price*</td>
<td align="center">*Agent Info*</td>
<td align="center">*Agent Email*</td>
<td align="center">*Agent Phone*</td>
<td align="center">&nbsp;</td>
</tr>
<tr>
<td align="left">Active</td>
<td align="left">Resale</td>
<td align="center">*Property Address*</td>
<td align="right">*Price*</td>
<td align="center">*Agent Info*</td>
<td align="center">*Agent Email*</td>
<td align="center">*Agent Phone*</td>
<td align="center">&nbsp;</td>
</tr>
...etc

With additional trs continuing containing 8 tds with the same information as detailed above.

What I need to do is iterate through the trs and internal tds to pick up each piece of information (inside the td) for each entry (inside of the tr).

Here is the code I have been struggling with:

<?php

$payload = array(
  'http'=>array(
     'method'=>"POST",
     'content'=>'key=value'
   )
);
stream_context_set_default($payload);
$dom = new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile('website-scraping-from.com');
libxml_clear_errors();

foreach ($dom->getElementsByTagName('tr') as $row){
    foreach($dom->$row->getElementsByTagName('td') as $node){
        echo $node->textContent . "<br/>";
    }

}


?>

This code is not returning nearly what I need and I am having a lot of trouble trying to figure out how to fix it, perhaps XPath is a better route to go to find the table / information I need, but I have come up empty with that method as well. Any information is much appreciated.

If it matters, my end goal is to be able to take the table data and dump it into a database if the first td has a value of "Active".

  • 写回答

1条回答 默认 最新

  • duan1979768678 2015-11-12 22:51
    关注

    Can this be of any help?

    $table = $dom->getElementsByTagName('table')->item(1);
    foreach ($table->getElementsByTagName('tr') as $row){
        $cells = $row->getElementsByTagName('td');
        if ( $cells->item(0)->nodeValue == 'Active' ) {
            foreach($cells as $node){
                echo $node->nodeValue . "<br/>";
            }
        }
    }
    

    This will fetch the second table, and display the contents of the rows starting with a first cell "Active".

    Edit: Here is a more extensive help:

    $arr = array();
    $table = $dom->getElementsByTagName('table')->item(1);
    foreach ($table->getElementsByTagName('tr') as $row){
        $cells = $row->getElementsByTagName('td');
        if ( $cells->item(0)->nodeValue == 'Active' ) {
            $obj = new stdClass;
            $obj->type    = $cells->item(1)->nodeValue;
            $obj->address = $cells->item(2)->nodeValue;
            $obj->price   = $cells->item(3)->nodeValue;
            $obj->agent   = $cells->item(4)->nodeValue;
            $obj->email   = $cells->item(5)->nodeValue;
            $obj->phone   = $cells->item(6)->nodeValue;
            array_push( $arr, $obj );
        }
    }
    print_r( $arr );
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 Stata 面板数据模型选择
  • ¥20 idea运行测试代码报错问题
  • ¥15 网络监控:网络故障告警通知
  • ¥15 django项目运行报编码错误
  • ¥15 请问这个是什么意思?
  • ¥15 STM32驱动继电器
  • ¥15 Windows server update services
  • ¥15 关于#c语言#的问题:我现在在做一个墨水屏设计,2.9英寸的小屏怎么换4.2英寸大屏
  • ¥15 模糊pid与pid仿真结果几乎一样
  • ¥15 java的GUI的运用