dragonsun2005 2013-04-12 12:42
浏览 110
已采纳

使用PHP和xPath从HTML中提取数据

I am trying to extract data from a webpage to insert it to a database. The data I'm interested in is in the div's which have a class="company". On one webpage there are 15 or less div's like that, and there are many pages I am trying to extract this data from. For this reason I am trying to find a automatic solution for data extraction.

The div with a class="company" is as follows (there are 15 or less divs like this on one page with different data):

<div class="company" id="company-6666"> <!-- EXTRACT 'company-6666' from id="company-6666" -->

  <div class="top clearfix">
    <div class="name clearfix">
      <h2>
        <a href="/company-name">Company Name</a>&nbsp; <!-- EXTRACT 'Company Name' from contents of A element and EXTRACT '/company-name' from href attribute -->
        <a href="/branches-list-link?parent_id=6666" class="branches">Branches <span>(5)</span></a> <!-- EXTRACT '/branches-list-link?parent_id=6666' from href attribute -->               
      </h2>
    </div>
  </div>

  <div class="inner clearfix has-logo">

    <div class="clearfix">          
      <div class="logo">
        <a href="/company-name">
          <img src="/graphics/company/logo/listing/123456.jpg?_ts=1365390237" border="0" alt="" /> <!-- EXTRACT '/graphics/company/logo/listing/123456.jpg?_ts=1365390237' from src attribute -->
        </a>
      </div>
      <div class="info">
        <div class="address">StreetName 500, 7777 City, County</div> <!-- EXTRACT 'StreetName 500, 7777 City, County' from contents of class="address" div -->
        <div class="clearfix">
          <div class="slogan">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ac condimentum mi.</div> <!-- EXTRACT 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ac condimentum mi.' from contents of class="slogan" div -->
        </div>
      </div>
    </div>

    <div class="actions-bar clearfix">
      <ul>              
        <li><span class="phone-number">6666666</span></li> <!-- EXTRACT '6666666' from contents of class="phone-number" div -->
        <li><a href="mailto:mail@mail.com" target="_blank" title="mail@mail.com" class="email">mail@mail.com</a></li> <!-- EXTRACT 'mail@mail.com' from contents of class="email" div -->
        <li><a href="http://www.webpage.com" target="_blank" title="www.webpage.com" class="redirect url">www.webpage.com</a></li> <!-- EXTRACT 'www.webpage.com' from contents of class="redirect url" div -->
      </ul>
    </div>

  </div>

</div>

So far I have the following PHP code (the $output has the webpage's HTML code):

<?php

$doc = new DomDocument();
@$doc->loadHTML($output);
$doc->preserveWhiteSpace = false; 

$xpath = new DomXPath($doc);

$elements = $xpath->query("//*[@class='company']");

if (!is_null($elements)) {
    foreach ($elements as $element) {
        echo $element->nodeValue;
    }
}

?>

It seems that it gets all of the 15 div's with class="company" but I have no idea how to extract the previously mentioned (in comments of HTML code) individual values.

Every div (I am talking about the div with class="company") doesn't have all the values written in the HTML block. So somehow I have to make a query if a specific div inside the company div, where the data i'm interested in, exists and if it exists I have to check if it is not empty (contains text between tags or not). If it exists and is not empty I add it to a variable.

Once values are extracted I would like to assign them to PHP variables which let's me to work with them afterwards. It would be even better if the values extracted are put in array like so:

$result = array(
    // 1'st div's data
    [0] =>  
        'company name' => 'company name',
        'company link' => 'company link',
        'company id' => 'company id',
        'company branches'  => 'branches link',
        'company logo'  => 'logo',
        'company address'  => 'address',
        'company slogan'  => 'slogan',
        'company webpage'  => 'webpage',
        'company email'  => 'email',
        'company phone'  => 'phone'

    // 2'nd div's data
    [1] =>  
        'company name' => 'company name',
        'company link' => 'company link',
        'company id' => 'company id',
        'company branches'  => 'branches link',
        'company logo'  => 'logo',
        'company address'  => 'address',
        'company slogan'  => 'slogan',
        'company webpage'  => 'webpage',
        'company email'  => 'email',
        'company phone'  => 'phone'
    ...
    )
  • 写回答

2条回答 默认 最新

  • dqysi86208 2013-04-12 23:10
    关注

    Each Company can be represented by a context-node while having each property represented by an xpath-expression relative to it:

    Company company-6666:
     ->id ....... = "company-6666"    --    string(@id)
     ->name ..... = "Company Name"    --    .//a[1]/text()
     ->href ..... = "/company-name"    --    .//a[1]/@href
     ->img ...... = "/graphics/company/logo/listing/123456.jpg?_ts=1365390237"    --    .//img[1]/@src
     ->address .. = "StreetName 500, 7777 City, County"    --    .//*[@class="address"]/text()
     ...
    

    If you wrap that into objects, this is pretty nifty to use:

    $doc = new DOMDocument();
    $doc->loadHTML($html);
    
    /* @var $companies DOMValueObject[] */
    $companies = new Companies($doc);
    
    foreach ($companies as $company) {
        printf("Company %s:
    ", $company->id);
        foreach ($company->getObjectProperties() as $name => $value) {
            $expression = $company->getPropertyExpression($name);
            printf(" ->%'.-10s = \"%s\"    --    %s
    ", $name.' ', $value, $expression);
        }
    }
    

    This works with DOMObjectCollection and DOMValueObject, defining your own type:

    class Companies extends DOMValueCollection
    {
        public function __construct(DOMDocument $doc) {
            parent::__construct($doc, '//*[@class="company"]');
        }
    
        /**
         * @return DOMValueObject
         */
        public function current() {
            $object = parent::current();
            $object->defineProperty('id', 'string(@id)');
            $object->defineProperty('name', './/a[1]/text()');
            $object->defineProperty('href', './/a[1]/@href');
            $object->defineProperty('img', './/img[1]/@src');
            $object->defineProperty('address', './/*[@class="address"]/text()');
            # ... add your definitions
            return $object;
        }
    }
    

    And for your array requirements there is a getArrayCopy() method:

    echo "
    Get Array Copy:
    
    ";
    
    print_r($companies->getArrayCopy());
    

    Output:

    Get Array Copy:
    
    Array
    (
        [0] => Array
            (
                [id] => company-6666
                [name] => Company Name
                [href] => /company-name
                [img] => /graphics/company/logo/listing/123456.jpg?_ts=1365390237
                [address] => StreetName 500, 7777 City, County
            )
    
    )
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 前端加access数据库
  • ¥15 ARCGIS 多值提取到点 ERROR 999999
  • ¥15 mysql异常断电, [MY-011971] [InnoDB]
  • ¥15 uni.onBluetoothDeviceFound熄屏不运行
  • ¥35 关于shodan搜索网络摄像头的各种方法详解
  • ¥15 求PHDA糖尿病并发症数据集,有偿
  • ¥15 为什么AVL fire DVI 界面里面的response Editor project 中的Summary result 点不了
  • ¥20 中标麒麟系统V4.0,linux3.10.0的内核,3A4000处理器,如何安装英伟达或AMD的显卡驱动,
  • ¥20 求文心中文心理分析系统(TextMind)
  • ¥15 chipyard环境搭建问题