使用PHP和xPath从HTML中提取数据

I am trying to extract data from a webpage to insert it to a database. The data I'm interested in is in the div's which have a class="company". On one webpage there are 15 or less div's like that, and there are many pages I am trying to extract this data from. For this reason I am trying to find a automatic solution for data extraction.

The div with a class="company" is as follows (there are 15 or less divs like this on one page with different data):

<div class="company" id="company-6666"> <!-- EXTRACT 'company-6666' from id="company-6666" -->

  <div class="top clearfix">
    <div class="name clearfix">
      <h2>
        <a href="/company-name">Company Name</a>&nbsp; <!-- EXTRACT 'Company Name' from contents of A element and EXTRACT '/company-name' from href attribute -->
        <a href="/branches-list-link?parent_id=6666" class="branches">Branches <span>(5)</span></a> <!-- EXTRACT '/branches-list-link?parent_id=6666' from href attribute -->               
      </h2>
    </div>
  </div>

  <div class="inner clearfix has-logo">

    <div class="clearfix">          
      <div class="logo">
        <a href="/company-name">
          <img src="/graphics/company/logo/listing/123456.jpg?_ts=1365390237" border="0" alt="" /> <!-- EXTRACT '/graphics/company/logo/listing/123456.jpg?_ts=1365390237' from src attribute -->
        </a>
      </div>
      <div class="info">
        <div class="address">StreetName 500, 7777 City, County</div> <!-- EXTRACT 'StreetName 500, 7777 City, County' from contents of class="address" div -->
        <div class="clearfix">
          <div class="slogan">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ac condimentum mi.</div> <!-- EXTRACT 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ac condimentum mi.' from contents of class="slogan" div -->
        </div>
      </div>
    </div>

    <div class="actions-bar clearfix">
      <ul>              
        <li><span class="phone-number">6666666</span></li> <!-- EXTRACT '6666666' from contents of class="phone-number" div -->
        <li><a href="mailto:mail@mail.com" target="_blank" title="mail@mail.com" class="email">mail@mail.com</a></li> <!-- EXTRACT 'mail@mail.com' from contents of class="email" div -->
        <li><a href="http://www.webpage.com" target="_blank" title="www.webpage.com" class="redirect url">www.webpage.com</a></li> <!-- EXTRACT 'www.webpage.com' from contents of class="redirect url" div -->
      </ul>
    </div>

  </div>

</div>

So far I have the following PHP code (the $output has the webpage's HTML code):

<?php

$doc = new DomDocument();
@$doc->loadHTML($output);
$doc->preserveWhiteSpace = false; 

$xpath = new DomXPath($doc);

$elements = $xpath->query("//*[@class='company']");

if (!is_null($elements)) {
    foreach ($elements as $element) {
        echo $element->nodeValue;
    }
}

?>

It seems that it gets all of the 15 div's with class="company" but I have no idea how to extract the previously mentioned (in comments of HTML code) individual values.

Every div (I am talking about the div with class="company") doesn't have all the values written in the HTML block. So somehow I have to make a query if a specific div inside the company div, where the data i'm interested in, exists and if it exists I have to check if it is not empty (contains text between tags or not). If it exists and is not empty I add it to a variable.

Once values are extracted I would like to assign them to PHP variables which let's me to work with them afterwards. It would be even better if the values extracted are put in array like so:

$result = array(
    // 1'st div's data
    [0] =>  
        'company name' => 'company name',
        'company link' => 'company link',
        'company id' => 'company id',
        'company branches'  => 'branches link',
        'company logo'  => 'logo',
        'company address'  => 'address',
        'company slogan'  => 'slogan',
        'company webpage'  => 'webpage',
        'company email'  => 'email',
        'company phone'  => 'phone'

    // 2'nd div's data
    [1] =>  
        'company name' => 'company name',
        'company link' => 'company link',
        'company id' => 'company id',
        'company branches'  => 'branches link',
        'company logo'  => 'logo',
        'company address'  => 'address',
        'company slogan'  => 'slogan',
        'company webpage'  => 'webpage',
        'company email'  => 'email',
        'company phone'  => 'phone'
    ...
    )

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

dqysi86208 2013-04-12 23:10

关注

Each Company can be represented by a context-node while having each property represented by an xpath-expression relative to it:

Company company-6666:
 ->id ....... = "company-6666"    --    string(@id)
 ->name ..... = "Company Name"    --    .//a[1]/text()
 ->href ..... = "/company-name"    --    .//a[1]/@href
 ->img ...... = "/graphics/company/logo/listing/123456.jpg?_ts=1365390237"    --    .//img[1]/@src
 ->address .. = "StreetName 500, 7777 City, County"    --    .//*[@class="address"]/text()
 ...

If you wrap that into objects, this is pretty nifty to use:

$doc = new DOMDocument();
$doc->loadHTML($html);

/* @var $companies DOMValueObject[] */
$companies = new Companies($doc);

foreach ($companies as $company) {
    printf("Company %s:
", $company->id);
    foreach ($company->getObjectProperties() as $name => $value) {
        $expression = $company->getPropertyExpression($name);
        printf(" ->%'.-10s = \"%s\"    --    %s
", $name.' ', $value, $expression);
    }
}

This works with DOMObjectCollection and DOMValueObject, defining your own type:

class Companies extends DOMValueCollection
{
    public function __construct(DOMDocument $doc) {
        parent::__construct($doc, '//*[@class="company"]');
    }

    /**
     * @return DOMValueObject
     */
    public function current() {
        $object = parent::current();
        $object->defineProperty('id', 'string(@id)');
        $object->defineProperty('name', './/a[1]/text()');
        $object->defineProperty('href', './/a[1]/@href');
        $object->defineProperty('img', './/img[1]/@src');
        $object->defineProperty('address', './/*[@class="address"]/text()');
        # ... add your definitions
        return $object;
    }
}

And for your array requirements there is a getArrayCopy() method:

echo "
Get Array Copy:

";

print_r($companies->getArrayCopy());

Output:

Get Array Copy:

Array
(
    [0] => Array
        (
            [id] => company-6666
            [name] => Company Name
            [href] => /company-name
            [img] => /graphics/company/logo/listing/123456.jpg?_ts=1365390237
            [address] => StreetName 500, 7777 City, County
        )

)

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

报告相同问题？

关注问题

Python爬虫——使用XPath和lxml库解析HTML
2022-05-28 13:45

Mount256的博客文章目录0 安装 XPath Helper 插件1 XPath 语法1.1 节点1.2 谓语2 lxml 库使用实例2.1 解析字符串为 HTML2.2 获取 div 标签2.3 获取某个指定的 div 标签2.4 获取属性为 id='even' 的 div 标签2.5 获取标签下的属性值...
XPath在数据采集中的应用：从XML和HTML中提取数据
2023-10-10 11:33

小小卡拉眯的博客 XPath，全称XML Path Language，是一种在XML文档中查找信息的语言。...XPath是一种强大的语言，用于在XML和HTML文档中定位和提取数据。它提供了一组丰富的路径选择和谓词过滤器，可以灵活地选择目标节点或节点集合。
XPath 使用逻辑运算符进行元素筛选
2023-07-13 19:30

挣扎的蓝藻的博客 XPath 是一种强大的查询语言，提供了丰富的运算符来筛选和定位元素。逻辑运算符是 XPath 中的...本篇博客将以使用逻辑运算符进行元素筛选为中心，介绍 XPath 中的逻辑运算符及其用法，帮助读者理解和应用这一关键概念。
如何使用 Python 提取 JSON 中的数据？
2022-06-21 21:51

m0_67402235的博客我们知道在爬虫的过程中我们对于爬取到的网页数据需要进行解析,因为大多数数据是不需要的,所以我们需要进行数据解析,常用的数据解析方式有正则表达式,xpath,bs4。这次我们来介绍一下另一个数据解析库–jsonpath,在此...
xpath在html里面的标签下面提取文本的时候，遇到了＜br /＞标签，无法获取全部文本
2024-07-16 23:41

清澈单纯小白的博客 xpathhttps://so.csdn.net/so/search?q=XPath&spm=1001.2101.3001.7020scrapyhttps://so.csdn.net/...这里我采用的解决办法是先获取到td标签，然后使用.xpath("string()"),来获取全部的文本内容。这样就解决啦。
基于JavaScript与PHP构建的网页内容搜索与抓取工具-用户输入处理与数据传递-网页内容抓取与信息检索-前端用户交互与后端数据爬取-实时响应与数据处理-动态网页内容解析与提取.zip
2025-10-19 10:23

通过发送HTTP请求到目标网页，获取网页的HTML内容，然后使用不同的解析技术（如XPath或DOM解析）提取所需数据。信息检索技术则确保从抓取到的数据中快速找到用户真正需要的部分。前端用户交互与后端数据爬取是工具...
您如何在PHP中解析和处理HTML / XML？
2019-12-04 10:40

asdfgh0077的博客如何解析HTML / XML并从中提取信息？
富文本中提取信息并去除其中的HTML或XML标签
2024-07-15 21:08

flying jiang的博客要从富文本中提取信息并去除其中的HTML或XML标签，可以使用不同的编程语言和库。
xpath应用大全
2024-09-25 21:15

喝旺仔la的博客 13、/div[contains(@class,"post")] 表示选取带有class属性且包含“post”的所有的div节点，取反//div[not(contains(@class,"post"))]8、 /div/a[2]/img 表示从根节点开始选取div节点下的第二个a节点下的img节点。...
xpath注入详解
2018-03-13 10:46

渗透测试中心的博客 XPath基于XML的树状结构，有不同类型的节点，包括元素节点，属性节点和文本节点，提供在数据结构树中找寻节点的能力，可用来在XML文档中对元素和属性进行遍历。 XPath 使用路径表达式来选取 XML 文档中...
没有解决我的问题, 去提问

码龄粉丝数原力等级 --

使用PHP和xPath从HTML中提取数据

2条回答默认最新

码龄粉丝数原力等级 --

使用PHP和xPath从HTML中提取数据

2条回答 默认 最新

2条回答默认最新