使用PHP和xPath从HTML中提取数据

I am trying to extract data from a webpage to insert it to a database. The data I'm interested in is in the div's which have a class="company". On one webpage there are 15 or less div's like that, and there are many pages I am trying to extract this data from. For this reason I am trying to find a automatic solution for data extraction.

The div with a class="company" is as follows (there are 15 or less divs like this on one page with different data):

<div class="company" id="company-6666"> <!-- EXTRACT 'company-6666' from id="company-6666" -->

  <div class="top clearfix">
    <div class="name clearfix">
      <h2>
        <a href="/company-name">Company Name</a>&nbsp; <!-- EXTRACT 'Company Name' from contents of A element and EXTRACT '/company-name' from href attribute -->
        <a href="/branches-list-link?parent_id=6666" class="branches">Branches <span>(5)</span></a> <!-- EXTRACT '/branches-list-link?parent_id=6666' from href attribute -->               
      </h2>
    </div>
  </div>

  <div class="inner clearfix has-logo">

    <div class="clearfix">          
      <div class="logo">
        <a href="/company-name">
          <img src="/graphics/company/logo/listing/123456.jpg?_ts=1365390237" border="0" alt="" /> <!-- EXTRACT '/graphics/company/logo/listing/123456.jpg?_ts=1365390237' from src attribute -->
        </a>
      </div>
      <div class="info">
        <div class="address">StreetName 500, 7777 City, County</div> <!-- EXTRACT 'StreetName 500, 7777 City, County' from contents of class="address" div -->
        <div class="clearfix">
          <div class="slogan">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ac condimentum mi.</div> <!-- EXTRACT 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ac condimentum mi.' from contents of class="slogan" div -->
        </div>
      </div>
    </div>

    <div class="actions-bar clearfix">
      <ul>              
        <li><span class="phone-number">6666666</span></li> <!-- EXTRACT '6666666' from contents of class="phone-number" div -->
        <li><a href="mailto:mail@mail.com" target="_blank" title="mail@mail.com" class="email">mail@mail.com</a></li> <!-- EXTRACT 'mail@mail.com' from contents of class="email" div -->
        <li><a href="http://www.webpage.com" target="_blank" title="www.webpage.com" class="redirect url">www.webpage.com</a></li> <!-- EXTRACT 'www.webpage.com' from contents of class="redirect url" div -->
      </ul>
    </div>

  </div>

</div>

So far I have the following PHP code (the $output has the webpage's HTML code):

<?php

$doc = new DomDocument();
@$doc->loadHTML($output);
$doc->preserveWhiteSpace = false; 

$xpath = new DomXPath($doc);

$elements = $xpath->query("//*[@class='company']");

if (!is_null($elements)) {
    foreach ($elements as $element) {
        echo $element->nodeValue;
    }
}

?>

It seems that it gets all of the 15 div's with class="company" but I have no idea how to extract the previously mentioned (in comments of HTML code) individual values.

Every div (I am talking about the div with class="company") doesn't have all the values written in the HTML block. So somehow I have to make a query if a specific div inside the company div, where the data i'm interested in, exists and if it exists I have to check if it is not empty (contains text between tags or not). If it exists and is not empty I add it to a variable.

Once values are extracted I would like to assign them to PHP variables which let's me to work with them afterwards. It would be even better if the values extracted are put in array like so:

$result = array(
    // 1'st div's data
    [0] =>  
        'company name' => 'company name',
        'company link' => 'company link',
        'company id' => 'company id',
        'company branches'  => 'branches link',
        'company logo'  => 'logo',
        'company address'  => 'address',
        'company slogan'  => 'slogan',
        'company webpage'  => 'webpage',
        'company email'  => 'email',
        'company phone'  => 'phone'

    // 2'nd div's data
    [1] =>  
        'company name' => 'company name',
        'company link' => 'company link',
        'company id' => 'company id',
        'company branches'  => 'branches link',
        'company logo'  => 'logo',
        'company address'  => 'address',
        'company slogan'  => 'slogan',
        'company webpage'  => 'webpage',
        'company email'  => 'email',
        'company phone'  => 'phone'
    ...
    )
dorbmd1177
dorbmd1177 等待它...我知道它即将来临......谁有“答案”我们都知道我的意思是什么答案!
7 年多之前 回复

2个回答

Each Company can be represented by a context-node while having each property represented by an xpath-expression relative to it:

Company company-6666:
 ->id ....... = "company-6666"    --    string(@id)
 ->name ..... = "Company Name"    --    .//a[1]/text()
 ->href ..... = "/company-name"    --    .//a[1]/@href
 ->img ...... = "/graphics/company/logo/listing/123456.jpg?_ts=1365390237"    --    .//img[1]/@src
 ->address .. = "StreetName 500, 7777 City, County"    --    .//*[@class="address"]/text()
 ...

If you wrap that into objects, this is pretty nifty to use:

$doc = new DOMDocument();
$doc->loadHTML($html);

/* @var $companies DOMValueObject[] */
$companies = new Companies($doc);

foreach ($companies as $company) {
    printf("Company %s:
", $company->id);
    foreach ($company->getObjectProperties() as $name => $value) {
        $expression = $company->getPropertyExpression($name);
        printf(" ->%'.-10s = \"%s\"    --    %s
", $name.' ', $value, $expression);
    }
}

This works with DOMObjectCollection and DOMValueObject, defining your own type:

class Companies extends DOMValueCollection
{
    public function __construct(DOMDocument $doc) {
        parent::__construct($doc, '//*[@class="company"]');
    }

    /**
     * @return DOMValueObject
     */
    public function current() {
        $object = parent::current();
        $object->defineProperty('id', 'string(@id)');
        $object->defineProperty('name', './/a[1]/text()');
        $object->defineProperty('href', './/a[1]/@href');
        $object->defineProperty('img', './/img[1]/@src');
        $object->defineProperty('address', './/*[@class="address"]/text()');
        # ... add your definitions
        return $object;
    }
}

And for your array requirements there is a getArrayCopy() method:

echo "
Get Array Copy:

";

print_r($companies->getArrayCopy());

Output:

Get Array Copy:

Array
(
    [0] => Array
        (
            [id] => company-6666
            [name] => Company Name
            [href] => /company-name
            [img] => /graphics/company/logo/listing/123456.jpg?_ts=1365390237
            [address] => StreetName 500, 7777 City, County
        )

)
douyan8027
douyan8027 你是对的。 我的PHP版本是5.3.23,但是通过更新的类,它可以完美地运行。 但无论如何谢谢你非常多!!! 这正是我寻找的解决方案,我得到的帮助比我希望的要多。 :)
7 年多之前 回复
dongyili5843
dongyili5843 这可能是因为你没有使用PHP 5.4。 但是不需要5.4,看看更新的要点。
7 年多之前 回复
duandange7480
duandange7480 其他定义
7 年多之前 回复
douyan4243
douyan4243 谢谢您的回答! 我将DOMValueCollection和DOMValueobject类定义放入单独的php文件中但是在检查错误时它告诉我:解析错误:语法错误,第24行的/class.php中的意外'['这是这行:$ this-> definitions [ $ name] = [$ expression,$ default];
7 年多之前 回复



要检查节点是否存在,请在返回的查询结果中验证length属性是否等于1:</ p>

if($ company_name-&gt; length == 1){
$ object-&gt; company_name = trim($ company_name-&gt; item(0) - &gt; nodeValue);
}

</ code> </ pre>
</ div>

展开原文

原文

To check if a node exists, verify that the length property is equal to 1 in the returned query result:

if ($company_name->length == 1) {
   $object->company_name = trim($company_name->item(0)->nodeValue);
}

Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
立即提问
相关内容推荐