doulu3865 2014-11-23 21:20
浏览 21
已采纳

PHP - 在Xpath查询中进一步挖掘

I been trying to learn how to use Xpath type of querys from this video: https://www.youtube.com/watch?v=632ql93H90g

While I have started to slightly understand everything I wanted to take it a bit further and try a nested looping extracting code to pull out nested elements and then categorize them. I been just using craigslist as an example because they started it in the video and have this listed under their "sites" webpage.

I've had to rewrite this because before it had an infinite loop. Now if ANYONE knows a better way of writing this I would love the input, but this is what I have.

All I been trying to do is get my results into the following format....

Country - State - CityNameTEXT - CityNameHREF

of course cityNameHref = thelink to the city.

Now right now I just have it print_r the results of the inner that has the actual city's listed since the format from craigslist is..

<h1>CountryName</h1>
<div class="colmask">
 <div>
  <h4>StateName</h4>
  <ul>
   <li>
    <a href="CityNameHREF">CityName</a>
   </li>
   <li>
    <a href="CityNameHREF">CityName</a>
   </li>
       <li>
    <a href="CityNameHREF">CityName</a>
   </li>
   <li>
    <a href="CityNameHREF">CityName</a>
   </li>
  </ul>
 </div>
</div>

As you can see its nested very complicated inside. I been trying literally for 12 hours to try and get this to work. This is the closest i've gotten where it will display the UL nodeValues being the actual city names. But I have NO CLUE how to get these citys to display correctly in the format I listed above.

Now on to the code I have...

$url = 'http://www.craigslist.org/about/sites';
$output = file_get_contents($url); 
$doc = new DOMDocument();

  libxml_use_internal_errors(true); //Supress Warnings for HTML5 conversion issue
  $doc->loadHTML($output);
  libxml_use_internal_errors(false); //Start Showing Errors

  $xpath = new DOMXpath($doc);


foreach ($xpath->query('//h1') as $e) 
    {
            $country = $e->nodeValue;
            $list = array();


            foreach ($xpath->query('//div[@class="colmask"]/div', $e) as $li) 
            {

                $state = $li->nodeValue;    
                    echo "<pre>";


                    $result = $xpath->query('//div[@class="colmask"]/div/ul', $e);


                    for ($i = 0; $i <= 10; $i++) //10 instead so it doesn't lag out
                    {


                    print_r($result->item($i));   //Displays the UL nodeValue
                    }


            }
    }  

Heres my example

  • 写回答

1条回答 默认 最新

  • douwudie8060 2014-11-23 22:15
    关注

    Try this :

    $url = 'http://www.craigslist.org/about/sites';
    $output = file_get_contents($url);
    $doc = new DOMDocument();
    
    libxml_use_internal_errors(true); //Supress Warnings for HTML5 conversion issue
    $doc->loadHTML($output);
    libxml_use_internal_errors(false); //Start Showing Errors
    
    $xpath = new DOMXpath($doc);
    
    foreach ($xpath->query('//h1') as $e) {
        $country = trim($e->textContent);
    
        foreach ($xpath->query('following-sibling::div[1]//h4', $e) as $h4) {
            $state = trim($h4->textContent);
    
            foreach ($xpath->query('following-sibling::ul[1]//li/a', $h4) as $a) {
                    $town = $a->textContent;
                    $attributeNodeMap = $a->attributes;
                    $nodeAttribute = $attributeNodeMap->getNamedItem("href");
                    $href = trim($nodeAttribute->nodeValue);
    
                    echo "$country - $state - $town - $href<br>";
            }
        }
    }
    

    EDIT

    So that's how I did it.
    First of all I'm using firefox with firebug and firepath (i guess you can find similar tools for other web browser).
    This tools let me try some Xpath without writing PHP code.

    With firebug you can see the DOM tree which is really useful to know what you can reach, ... and then try Xpath with firepath

    To start i selected all H1 nodes //h1 in the document and then you need to get all H4 for each H1 to get state but unfortunately H4 node is not a child of H1 node, so you need to find another way to reach it if you want to start from H1 node.

    If you look at DOM tree you will see that a div (which contains H4 node) is one of the next sibling of H1 node, so let's select it following-sibling::div[1] (this is the div <div class="colmask"> for the current h1 node only).
    We want all H4 nodes //h4 then we've got following-sibling::div[1]//h4

    Now we do the same thing for the <a href...> for each H4, so we select all A nodes in all LI nodes which are in the next sibling UL of H4 following-sibling::ul[1]//li/a

    I hope this is understandable (and useful of course) and sorry for the mistakes, English is not my language.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 ELGamal和paillier计算效率谁快?
  • ¥15 file converter 转换格式失败 报错 Error marking filters as finished,如何解决?
  • ¥15 ubuntu系统下挂载磁盘上执行./提示权限不够
  • ¥15 Arcgis相交分析无法绘制一个或多个图形
  • ¥15 关于#r语言#的问题:差异分析前数据准备,报错Error in data[, sampleName1] : subscript out of bounds请问怎么解决呀以下是全部代码:
  • ¥15 seatunnel-web使用SQL组件时候后台报错,无法找到表格
  • ¥15 fpga自动售货机数码管(相关搜索:数字时钟)
  • ¥15 用前端向数据库插入数据,通过debug发现数据能走到后端,但是放行之后就会提示错误
  • ¥30 3天&7天&&15天&销量如何统计同一行
  • ¥30 帮我写一段可以读取LD2450数据并计算距离的Arduino代码