PHP - 在Xpath查询中进一步挖掘

I been trying to learn how to use Xpath type of querys from this video: https://www.youtube.com/watch?v=632ql93H90g

While I have started to slightly understand everything I wanted to take it a bit further and try a nested looping extracting code to pull out nested elements and then categorize them. I been just using craigslist as an example because they started it in the video and have this listed under their "sites" webpage.

I've had to rewrite this because before it had an infinite loop. Now if ANYONE knows a better way of writing this I would love the input, but this is what I have.

All I been trying to do is get my results into the following format....

Country - State - CityNameTEXT - CityNameHREF

of course cityNameHref = thelink to the city.

Now right now I just have it print_r the results of the inner that has the actual city's listed since the format from craigslist is..

<h1>CountryName</h1>
<div class="colmask">
 <div>
  <h4>StateName</h4>
  <ul>
   <li>
    <a href="CityNameHREF">CityName</a>
   </li>
   <li>
    <a href="CityNameHREF">CityName</a>
   </li>
       <li>
    <a href="CityNameHREF">CityName</a>
   </li>
   <li>
    <a href="CityNameHREF">CityName</a>
   </li>
  </ul>
 </div>
</div>

As you can see its nested very complicated inside. I been trying literally for 12 hours to try and get this to work. This is the closest i've gotten where it will display the UL nodeValues being the actual city names. But I have NO CLUE how to get these citys to display correctly in the format I listed above.

Now on to the code I have...

$url = 'http://www.craigslist.org/about/sites';
$output = file_get_contents($url); 
$doc = new DOMDocument();

  libxml_use_internal_errors(true); //Supress Warnings for HTML5 conversion issue
  $doc->loadHTML($output);
  libxml_use_internal_errors(false); //Start Showing Errors

  $xpath = new DOMXpath($doc);


foreach ($xpath->query('//h1') as $e) 
    {
            $country = $e->nodeValue;
            $list = array();


            foreach ($xpath->query('//div[@class="colmask"]/div', $e) as $li) 
            {

                $state = $li->nodeValue;    
                    echo "<pre>";


                    $result = $xpath->query('//div[@class="colmask"]/div/ul', $e);


                    for ($i = 0; $i <= 10; $i++) //10 instead so it doesn't lag out
                    {


                    print_r($result->item($i));   //Displays the UL nodeValue
                    }


            }
    }

Heres my example

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douwudie8060 2014-11-23 22:15
关注
Try this :

$url = 'http://www.craigslist.org/about/sites'; $output = file_get_contents($url); $doc = new DOMDocument(); libxml_use_internal_errors(true); //Supress Warnings for HTML5 conversion issue $doc->loadHTML($output); libxml_use_internal_errors(false); //Start Showing Errors $xpath = new DOMXpath($doc); foreach ($xpath->query('//h1') as $e) { $country = trim($e->textContent); foreach ($xpath->query('following-sibling::div[1]//h4', $e) as $h4) { $state = trim($h4->textContent); foreach ($xpath->query('following-sibling::ul[1]//li/a', $h4) as $a) { $town = $a->textContent; $attributeNodeMap = $a->attributes; $nodeAttribute = $attributeNodeMap->getNamedItem("href"); $href = trim($nodeAttribute->nodeValue); echo "$country - $state - $town - $href<br>"; } } }

EDIT

So that's how I did it.
First of all I'm using firefox with firebug and firepath (i guess you can find similar tools for other web browser).
This tools let me try some Xpath without writing PHP code.

With firebug you can see the DOM tree which is really useful to know what you can reach, ... and then try Xpath with firepath

To start i selected all H1 nodes //h1 in the document and then you need to get all H4 for each H1 to get state but unfortunately H4 node is not a child of H1 node, so you need to find another way to reach it if you want to start from H1 node.

If you look at DOM tree you will see that a div (which contains H4 node) is one of the next sibling of H1 node, so let's select it following-sibling::div[1] (this is the div <div class="colmask"> for the current h1 node only).
We want all H4 nodes //h4 then we've got following-sibling::div[1]//h4

Now we do the same thing for the <a href...> for each H4, so we select all A nodes in all LI nodes which are in the next sibling UL of H4 following-sibling::ul[1]//li/a

I hope this is understandable (and useful of course) and sorry for the mistakes, English is not my language.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

PHP - 在Xpath查询中进一步挖掘 php
2014-11-23 21:20

回答 1 已采纳 Try this : $url = 'http://www.craigslist.org/about/sites'; $output = file_get_contents($url); $d
在PHP中使用XPath替换XML属性 php xml
2019-06-11 17:26

回答 1 已采纳 The answer as Nigel Ren suggested was just to remove these two lines, as they no longer apply: $
如何在php中使用follow-sibling xpath查询？ php
2012-09-13 20:40

回答 2 已采纳 The problem isn't with the following-sibling part. By that point, you already have no nodes select
php实现p2p中DHT网络爬虫，并提供搜索下载.zip
2024-03-23 22:03

爬虫（Web Crawler）是一种自动...爬虫在各个领域都有广泛的应用，包括搜索引擎索引、数据挖掘、价格监测、新闻聚合等。然而，使用爬虫需要遵守法律和伦理规范，尊重网站的使用政策，并确保对被访问网站的服务器负责。
PHP $ xpath->查询循环 php
2019-02-28 08:35

回答 1 已采纳 Because you're using single quotes your resulting query string looks exactly like this (with $i an
PHP - DomXPath空标签 php
2016-09-02 20:18

回答 1 已采纳 foreach($elements as $index => $element) { $dom = new DOMDocument(); $dom->appendChi
在Xpath查询中排除链接 php
2018-12-23 22:25

回答 1 已采纳 You can exclude link text nodes from results with //div[@class="intro"]//text()[not(parent::a)]
漏洞挖掘技巧-开源程序漏洞挖掘
2021-12-20 11:41

告白热的博客挖掘漏洞的时候挖掘的对象的框架涉及到两种： > 第一种是闭源框架，闭源框架里面涉及到两种，一是个人开发，不对外使用，或者未公开，二是商用型，需要甲方购买使用，这些一般都是闭源的，很难拿到源码。 > 第二种是...
PHP和XPath查询 php
2017-04-12 18:17

回答 1 已采纳 There are a few approaches to do this. First of all, you should register the namespace: $xml->
使用DOMXPath在PHP中调用XML数据 php xml
2018-10-01 03:03

回答 1 已采纳 The problem is that there is a namespace on your VehicleDescription element. You need to register
在PHP中使用XPath获取href属性 php
2015-06-06 09:23

回答 1 已采纳 To get all href attributes of the hyperlinks, add some more axis steps, finally loop over the resu
PHP爬虫Demo.zip
2024-03-23 15:12

爬虫（Web Crawler）是一种自动...爬虫在各个领域都有广泛的应用，包括搜索引擎索引、数据挖掘、价格监测、新闻聚合等。然而，使用爬虫需要遵守法律和伦理规范，尊重网站的使用政策，并确保对被访问网站的服务器负责。
在PHP中使用XPath循环 php
2014-04-19 13:48

回答 1 已采纳 You can try the following approach. <?php $url = 'http://www.oxybet.ro/pariu/external/betfair-
php实现的dht爬虫.zip
2024-03-23 15:12

爬虫（Web Crawler）是一种自动...爬虫在各个领域都有广泛的应用，包括搜索引擎索引、数据挖掘、价格监测、新闻聚合等。然而，使用爬虫需要遵守法律和伦理规范，尊重网站的使用政策，并确保对被访问网站的服务器负责。
快速、简洁且强大的PHP爬虫框架.zip
2024-03-23 19:51

爬虫（Web Crawler）是一种自动...爬虫在各个领域都有广泛的应用，包括搜索引擎索引、数据挖掘、价格监测、新闻聚合等。然而，使用爬虫需要遵守法律和伦理规范，尊重网站的使用政策，并确保对被访问网站的服务器负责。
没有解决我的问题, 去提问

悬赏问题

¥15 抖音咸鱼付款链接转码支付宝
¥15 ubuntu22.04上安装ursim-3.15.8.106339遇到的问题
¥15 求螺旋焊缝的图像处理
¥15 blast算法（相关搜索：数据库）
¥15 请问有人会紧聚焦相关的matlab知识嘛？
¥15 网络通信安全解决方案
¥50 yalmip+Gurobi
¥20 win10修改放大文本以及缩放与布局后蓝屏无法正常进入桌面
¥15 itunes恢复数据最后一步发生错误
¥15 关于#windows#的问题：2024年5月15日的win11更新后资源管理器没有地址栏了顶部的地址栏和文件搜索都消失了

PHP - 在Xpath查询中进一步挖掘

1条回答 默认 最新

悬赏问题

1条回答默认最新