PHP - 在Xpath查询中进一步挖掘

I been trying to learn how to use Xpath type of querys from this video: https://www.youtube.com/watch?v=632ql93H90g

While I have started to slightly understand everything I wanted to take it a bit further and try a nested looping extracting code to pull out nested elements and then categorize them. I been just using craigslist as an example because they started it in the video and have this listed under their "sites" webpage.

I've had to rewrite this because before it had an infinite loop. Now if ANYONE knows a better way of writing this I would love the input, but this is what I have.

All I been trying to do is get my results into the following format....

Country - State - CityNameTEXT - CityNameHREF

of course cityNameHref = thelink to the city.

Now right now I just have it print_r the results of the inner that has the actual city's listed since the format from craigslist is..

<h1>CountryName</h1>
<div class="colmask">
 <div>
  <h4>StateName</h4>
  <ul>
   <li>
    <a href="CityNameHREF">CityName</a>
   </li>
   <li>
    <a href="CityNameHREF">CityName</a>
   </li>
       <li>
    <a href="CityNameHREF">CityName</a>
   </li>
   <li>
    <a href="CityNameHREF">CityName</a>
   </li>
  </ul>
 </div>
</div>

As you can see its nested very complicated inside. I been trying literally for 12 hours to try and get this to work. This is the closest i've gotten where it will display the UL nodeValues being the actual city names. But I have NO CLUE how to get these citys to display correctly in the format I listed above.

Now on to the code I have...

$url = 'http://www.craigslist.org/about/sites';
$output = file_get_contents($url); 
$doc = new DOMDocument();

  libxml_use_internal_errors(true); //Supress Warnings for HTML5 conversion issue
  $doc->loadHTML($output);
  libxml_use_internal_errors(false); //Start Showing Errors

  $xpath = new DOMXpath($doc);


foreach ($xpath->query('//h1') as $e) 
    {
            $country = $e->nodeValue;
            $list = array();


            foreach ($xpath->query('//div[@class="colmask"]/div', $e) as $li) 
            {

                $state = $li->nodeValue;    
                    echo "<pre>";


                    $result = $xpath->query('//div[@class="colmask"]/div/ul', $e);


                    for ($i = 0; $i <= 10; $i++) //10 instead so it doesn't lag out
                    {


                    print_r($result->item($i));   //Displays the UL nodeValue
                    }


            }
    }

Heres my example

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douwudie8060 2014-11-23 22:15
关注
Try this :

$url = 'http://www.craigslist.org/about/sites'; $output = file_get_contents($url); $doc = new DOMDocument(); libxml_use_internal_errors(true); //Supress Warnings for HTML5 conversion issue $doc->loadHTML($output); libxml_use_internal_errors(false); //Start Showing Errors $xpath = new DOMXpath($doc); foreach ($xpath->query('//h1') as $e) { $country = trim($e->textContent); foreach ($xpath->query('following-sibling::div[1]//h4', $e) as $h4) { $state = trim($h4->textContent); foreach ($xpath->query('following-sibling::ul[1]//li/a', $h4) as $a) { $town = $a->textContent; $attributeNodeMap = $a->attributes; $nodeAttribute = $attributeNodeMap->getNamedItem("href"); $href = trim($nodeAttribute->nodeValue); echo "$country - $state - $town - $href<br>"; } } }

EDIT

So that's how I did it.
First of all I'm using firefox with firebug and firepath (i guess you can find similar tools for other web browser).
This tools let me try some Xpath without writing PHP code.

With firebug you can see the DOM tree which is really useful to know what you can reach, ... and then try Xpath with firepath

To start i selected all H1 nodes //h1 in the document and then you need to get all H4 for each H1 to get state but unfortunately H4 node is not a child of H1 node, so you need to find another way to reach it if you want to start from H1 node.

If you look at DOM tree you will see that a div (which contains H4 node) is one of the next sibling of H1 node, so let's select it following-sibling::div[1] (this is the div <div class="colmask"> for the current h1 node only).
We want all H4 nodes //h4 then we've got following-sibling::div[1]//h4

Now we do the same thing for the <a href...> for each H4, so we select all A nodes in all LI nodes which are in the next sibling UL of H4 following-sibling::ul[1]//li/a

I hope this is understandable (and useful of course) and sorry for the mistakes, English is not my language.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

基于PHP的小刀php网站自动收录UTF-8.zip
2023-08-29 00:16

在PHP中实现自动收录，开发者通常会使用PHP的内置函数如`file_get_contents()`来获取远程网页内容，或者使用cURL库来处理更复杂的HTTP请求。解析HTML时，可以利用DOMDocument和DOMXPath库，以便提取所需信息。为了...
PHP 爬虫：如何使用 XPath 解析 XML 文档
2024-07-15 08:55

HelloDeveloper2024的博客 XPath 是 XML Path Language 的缩写，即 XML 路径语言，XPath 通过在 XML 文档中寻找特定元素，使用路径表达式掌握文档的结构，从而定位文档中的具体数据。比如，“/” 表示根节点，“bookstore” 表示 XML 文档根...
基于PHP的天宇php百度贴吧爬虫.zip
2023-08-28 00:02

3. **DOM解析**：爬虫需要解析HTML文档，找出有价值的信息，可能用到了PHP的DOMDocument或DOMXPath类，它们能帮助解析和查询XML或HTML文档。 4. **正则表达式**：PHP的preg_match()和preg_replace()等函数可用于提取...
基于PHP的DZ php论坛万能爬虫程序.zip
2023-07-18 21:52

4. **正则表达式**：正则表达式在匹配和提取网页中的特定数据时非常有用，PHP的preg_match_all()函数可以帮助实现这一目标。 5. **Session和Cookie处理**：DZ论坛可能会使用Session和Cookie来管理用户状态，爬虫...
基于PHP的Googlephp网页搜索抓取源码.zip
2023-08-26 21:50

在IT行业中，PHP是一种广泛使用的服务器端脚本语言，尤其在网页开发中占据了重要的地位。这个"基于PHP的Googlephp网页搜索抓取源码.zip"压缩包文件显然包含了一个使用PHP编写的程序，该程序旨在抓取并处理Google搜索...
基于PHP的给力搜索爬虫开源源码.zip
2023-07-25 23:21

2. PHP与HTML的结合：通过在HTML中嵌入PHP代码，实现动态网页生成。 3. PHP与数据库交互：通常使用MySQL等数据库，通过PDO或mysqli扩展进行连接和操作。 4. 文件操作：PHP可以读写文件，这对于爬虫存储抓取的数据至...
基于PHP的小说爬虫程序.zip
2023-08-29 00:16

爬虫程序广泛应用于数据挖掘、数据分析以及内容聚合等领域，尤其在需要大量文本数据时，如搜索引擎、推荐系统或个人项目中，它们能高效地收集和整理网络资源。【PHP】 PHP（Hypertext Preprocessor，超文本预...
基于PHP的百度影音爬虫.zip
2023-07-24 22:02

4. **HTML解析**：为了从HTML页面中提取数据，通常会用到PHP的DOM解析库如DOMDocument和DOMXPath，或者第三方库如SimpleXMLElement。它们可以帮助定位并提取HTML元素中的信息。 5. **百度影音API**：如果百度影音...
基于PHP的六号问问爬虫程序源码 php版.zip
2023-08-13 23:37

【描述】中提到的“基于PHP的六号问问爬虫程序源码 php版.zip”进一步强调了这是一个以PHP实现的源代码包，它已经被压缩成ZIP格式，方便下载和分发。源码通常包含所有必要的文件，如PHP脚本、配置文件、可能的数据库...
PHP实例开发源码—军事新闻爬虫.zip
2022-11-23 03:01

【标题】"PHP实例开发源码—军事新闻爬虫.zip" 涉及的主要知识点是PHP编程语言在实际中的应用，特别是针对军事新闻网站的数据抓取和处理。爬虫技术是互联网数据挖掘的一种常见手段，它允许开发者自动地获取网页信息...
没有解决我的问题, 去提问

PHP - 在Xpath查询中进一步挖掘

1条回答 默认 最新

1条回答默认最新