doulu3865 2014-11-23 21:20
浏览 21
已采纳

PHP - 在Xpath查询中进一步挖掘

I been trying to learn how to use Xpath type of querys from this video: https://www.youtube.com/watch?v=632ql93H90g

While I have started to slightly understand everything I wanted to take it a bit further and try a nested looping extracting code to pull out nested elements and then categorize them. I been just using craigslist as an example because they started it in the video and have this listed under their "sites" webpage.

I've had to rewrite this because before it had an infinite loop. Now if ANYONE knows a better way of writing this I would love the input, but this is what I have.

All I been trying to do is get my results into the following format....

Country - State - CityNameTEXT - CityNameHREF

of course cityNameHref = thelink to the city.

Now right now I just have it print_r the results of the inner that has the actual city's listed since the format from craigslist is..

<h1>CountryName</h1>
<div class="colmask">
 <div>
  <h4>StateName</h4>
  <ul>
   <li>
    <a href="CityNameHREF">CityName</a>
   </li>
   <li>
    <a href="CityNameHREF">CityName</a>
   </li>
       <li>
    <a href="CityNameHREF">CityName</a>
   </li>
   <li>
    <a href="CityNameHREF">CityName</a>
   </li>
  </ul>
 </div>
</div>

As you can see its nested very complicated inside. I been trying literally for 12 hours to try and get this to work. This is the closest i've gotten where it will display the UL nodeValues being the actual city names. But I have NO CLUE how to get these citys to display correctly in the format I listed above.

Now on to the code I have...

$url = 'http://www.craigslist.org/about/sites';
$output = file_get_contents($url); 
$doc = new DOMDocument();

  libxml_use_internal_errors(true); //Supress Warnings for HTML5 conversion issue
  $doc->loadHTML($output);
  libxml_use_internal_errors(false); //Start Showing Errors

  $xpath = new DOMXpath($doc);


foreach ($xpath->query('//h1') as $e) 
    {
            $country = $e->nodeValue;
            $list = array();


            foreach ($xpath->query('//div[@class="colmask"]/div', $e) as $li) 
            {

                $state = $li->nodeValue;    
                    echo "<pre>";


                    $result = $xpath->query('//div[@class="colmask"]/div/ul', $e);


                    for ($i = 0; $i <= 10; $i++) //10 instead so it doesn't lag out
                    {


                    print_r($result->item($i));   //Displays the UL nodeValue
                    }


            }
    }  

Heres my example

  • 写回答

1条回答 默认 最新

  • douwudie8060 2014-11-23 22:15
    关注

    Try this :

    $url = 'http://www.craigslist.org/about/sites';
    $output = file_get_contents($url);
    $doc = new DOMDocument();
    
    libxml_use_internal_errors(true); //Supress Warnings for HTML5 conversion issue
    $doc->loadHTML($output);
    libxml_use_internal_errors(false); //Start Showing Errors
    
    $xpath = new DOMXpath($doc);
    
    foreach ($xpath->query('//h1') as $e) {
        $country = trim($e->textContent);
    
        foreach ($xpath->query('following-sibling::div[1]//h4', $e) as $h4) {
            $state = trim($h4->textContent);
    
            foreach ($xpath->query('following-sibling::ul[1]//li/a', $h4) as $a) {
                    $town = $a->textContent;
                    $attributeNodeMap = $a->attributes;
                    $nodeAttribute = $attributeNodeMap->getNamedItem("href");
                    $href = trim($nodeAttribute->nodeValue);
    
                    echo "$country - $state - $town - $href<br>";
            }
        }
    }
    

    EDIT

    So that's how I did it.
    First of all I'm using firefox with firebug and firepath (i guess you can find similar tools for other web browser).
    This tools let me try some Xpath without writing PHP code.

    With firebug you can see the DOM tree which is really useful to know what you can reach, ... and then try Xpath with firepath

    To start i selected all H1 nodes //h1 in the document and then you need to get all H4 for each H1 to get state but unfortunately H4 node is not a child of H1 node, so you need to find another way to reach it if you want to start from H1 node.

    If you look at DOM tree you will see that a div (which contains H4 node) is one of the next sibling of H1 node, so let's select it following-sibling::div[1] (this is the div <div class="colmask"> for the current h1 node only).
    We want all H4 nodes //h4 then we've got following-sibling::div[1]//h4

    Now we do the same thing for the <a href...> for each H4, so we select all A nodes in all LI nodes which are in the next sibling UL of H4 following-sibling::ul[1]//li/a

    I hope this is understandable (and useful of course) and sorry for the mistakes, English is not my language.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 抖音咸鱼付款链接转码支付宝
  • ¥15 ubuntu22.04上安装ursim-3.15.8.106339遇到的问题
  • ¥15 求螺旋焊缝的图像处理
  • ¥15 blast算法(相关搜索:数据库)
  • ¥15 请问有人会紧聚焦相关的matlab知识嘛?
  • ¥15 网络通信安全解决方案
  • ¥50 yalmip+Gurobi
  • ¥20 win10修改放大文本以及缩放与布局后蓝屏无法正常进入桌面
  • ¥15 itunes恢复数据最后一步发生错误
  • ¥15 关于#windows#的问题:2024年5月15日的win11更新后资源管理器没有地址栏了顶部的地址栏和文件搜索都消失了