dragonsun2005 2013-04-12 12:42
浏览 110
已采纳

使用PHP和xPath从HTML中提取数据

I am trying to extract data from a webpage to insert it to a database. The data I'm interested in is in the div's which have a class="company". On one webpage there are 15 or less div's like that, and there are many pages I am trying to extract this data from. For this reason I am trying to find a automatic solution for data extraction.

The div with a class="company" is as follows (there are 15 or less divs like this on one page with different data):

<div class="company" id="company-6666"> <!-- EXTRACT 'company-6666' from id="company-6666" -->

  <div class="top clearfix">
    <div class="name clearfix">
      <h2>
        <a href="/company-name">Company Name</a>&nbsp; <!-- EXTRACT 'Company Name' from contents of A element and EXTRACT '/company-name' from href attribute -->
        <a href="/branches-list-link?parent_id=6666" class="branches">Branches <span>(5)</span></a> <!-- EXTRACT '/branches-list-link?parent_id=6666' from href attribute -->               
      </h2>
    </div>
  </div>

  <div class="inner clearfix has-logo">

    <div class="clearfix">          
      <div class="logo">
        <a href="/company-name">
          <img src="/graphics/company/logo/listing/123456.jpg?_ts=1365390237" border="0" alt="" /> <!-- EXTRACT '/graphics/company/logo/listing/123456.jpg?_ts=1365390237' from src attribute -->
        </a>
      </div>
      <div class="info">
        <div class="address">StreetName 500, 7777 City, County</div> <!-- EXTRACT 'StreetName 500, 7777 City, County' from contents of class="address" div -->
        <div class="clearfix">
          <div class="slogan">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ac condimentum mi.</div> <!-- EXTRACT 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ac condimentum mi.' from contents of class="slogan" div -->
        </div>
      </div>
    </div>

    <div class="actions-bar clearfix">
      <ul>              
        <li><span class="phone-number">6666666</span></li> <!-- EXTRACT '6666666' from contents of class="phone-number" div -->
        <li><a href="mailto:mail@mail.com" target="_blank" title="mail@mail.com" class="email">mail@mail.com</a></li> <!-- EXTRACT 'mail@mail.com' from contents of class="email" div -->
        <li><a href="http://www.webpage.com" target="_blank" title="www.webpage.com" class="redirect url">www.webpage.com</a></li> <!-- EXTRACT 'www.webpage.com' from contents of class="redirect url" div -->
      </ul>
    </div>

  </div>

</div>

So far I have the following PHP code (the $output has the webpage's HTML code):

<?php

$doc = new DomDocument();
@$doc->loadHTML($output);
$doc->preserveWhiteSpace = false; 

$xpath = new DomXPath($doc);

$elements = $xpath->query("//*[@class='company']");

if (!is_null($elements)) {
    foreach ($elements as $element) {
        echo $element->nodeValue;
    }
}

?>

It seems that it gets all of the 15 div's with class="company" but I have no idea how to extract the previously mentioned (in comments of HTML code) individual values.

Every div (I am talking about the div with class="company") doesn't have all the values written in the HTML block. So somehow I have to make a query if a specific div inside the company div, where the data i'm interested in, exists and if it exists I have to check if it is not empty (contains text between tags or not). If it exists and is not empty I add it to a variable.

Once values are extracted I would like to assign them to PHP variables which let's me to work with them afterwards. It would be even better if the values extracted are put in array like so:

$result = array(
    // 1'st div's data
    [0] =>  
        'company name' => 'company name',
        'company link' => 'company link',
        'company id' => 'company id',
        'company branches'  => 'branches link',
        'company logo'  => 'logo',
        'company address'  => 'address',
        'company slogan'  => 'slogan',
        'company webpage'  => 'webpage',
        'company email'  => 'email',
        'company phone'  => 'phone'

    // 2'nd div's data
    [1] =>  
        'company name' => 'company name',
        'company link' => 'company link',
        'company id' => 'company id',
        'company branches'  => 'branches link',
        'company logo'  => 'logo',
        'company address'  => 'address',
        'company slogan'  => 'slogan',
        'company webpage'  => 'webpage',
        'company email'  => 'email',
        'company phone'  => 'phone'
    ...
    )
  • 写回答

2条回答 默认 最新

  • dqysi86208 2013-04-12 23:10
    关注

    Each Company can be represented by a context-node while having each property represented by an xpath-expression relative to it:

    Company company-6666:
     ->id ....... = "company-6666"    --    string(@id)
     ->name ..... = "Company Name"    --    .//a[1]/text()
     ->href ..... = "/company-name"    --    .//a[1]/@href
     ->img ...... = "/graphics/company/logo/listing/123456.jpg?_ts=1365390237"    --    .//img[1]/@src
     ->address .. = "StreetName 500, 7777 City, County"    --    .//*[@class="address"]/text()
     ...
    

    If you wrap that into objects, this is pretty nifty to use:

    $doc = new DOMDocument();
    $doc->loadHTML($html);
    
    /* @var $companies DOMValueObject[] */
    $companies = new Companies($doc);
    
    foreach ($companies as $company) {
        printf("Company %s:
    ", $company->id);
        foreach ($company->getObjectProperties() as $name => $value) {
            $expression = $company->getPropertyExpression($name);
            printf(" ->%'.-10s = \"%s\"    --    %s
    ", $name.' ', $value, $expression);
        }
    }
    

    This works with DOMObjectCollection and DOMValueObject, defining your own type:

    class Companies extends DOMValueCollection
    {
        public function __construct(DOMDocument $doc) {
            parent::__construct($doc, '//*[@class="company"]');
        }
    
        /**
         * @return DOMValueObject
         */
        public function current() {
            $object = parent::current();
            $object->defineProperty('id', 'string(@id)');
            $object->defineProperty('name', './/a[1]/text()');
            $object->defineProperty('href', './/a[1]/@href');
            $object->defineProperty('img', './/img[1]/@src');
            $object->defineProperty('address', './/*[@class="address"]/text()');
            # ... add your definitions
            return $object;
        }
    }
    

    And for your array requirements there is a getArrayCopy() method:

    echo "
    Get Array Copy:
    
    ";
    
    print_r($companies->getArrayCopy());
    

    Output:

    Get Array Copy:
    
    Array
    (
        [0] => Array
            (
                [id] => company-6666
                [name] => Company Name
                [href] => /company-name
                [img] => /graphics/company/logo/listing/123456.jpg?_ts=1365390237
                [address] => StreetName 500, 7777 City, County
            )
    
    )
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥20 腾讯企业邮箱邮件可以恢复么
  • ¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗?
  • ¥15 错误 LNK2001 无法解析的外部符号
  • ¥50 安装pyaudiokits失败
  • ¥15 计组这些题应该咋做呀
  • ¥60 更换迈创SOL6M4AE卡的时候,驱动要重新装才能使用,怎么解决?
  • ¥15 让node服务器有自动加载文件的功能
  • ¥15 jmeter脚本回放有的是对的有的是错的
  • ¥15 r语言蛋白组学相关问题
  • ¥15 Python时间序列如何拟合疏系数模型