dragonsun2005 2013-04-12 12:42
浏览 110
已采纳

使用PHP和xPath从HTML中提取数据

I am trying to extract data from a webpage to insert it to a database. The data I'm interested in is in the div's which have a class="company". On one webpage there are 15 or less div's like that, and there are many pages I am trying to extract this data from. For this reason I am trying to find a automatic solution for data extraction.

The div with a class="company" is as follows (there are 15 or less divs like this on one page with different data):

<div class="company" id="company-6666"> <!-- EXTRACT 'company-6666' from id="company-6666" -->

  <div class="top clearfix">
    <div class="name clearfix">
      <h2>
        <a href="/company-name">Company Name</a>&nbsp; <!-- EXTRACT 'Company Name' from contents of A element and EXTRACT '/company-name' from href attribute -->
        <a href="/branches-list-link?parent_id=6666" class="branches">Branches <span>(5)</span></a> <!-- EXTRACT '/branches-list-link?parent_id=6666' from href attribute -->               
      </h2>
    </div>
  </div>

  <div class="inner clearfix has-logo">

    <div class="clearfix">          
      <div class="logo">
        <a href="/company-name">
          <img src="/graphics/company/logo/listing/123456.jpg?_ts=1365390237" border="0" alt="" /> <!-- EXTRACT '/graphics/company/logo/listing/123456.jpg?_ts=1365390237' from src attribute -->
        </a>
      </div>
      <div class="info">
        <div class="address">StreetName 500, 7777 City, County</div> <!-- EXTRACT 'StreetName 500, 7777 City, County' from contents of class="address" div -->
        <div class="clearfix">
          <div class="slogan">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ac condimentum mi.</div> <!-- EXTRACT 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ac condimentum mi.' from contents of class="slogan" div -->
        </div>
      </div>
    </div>

    <div class="actions-bar clearfix">
      <ul>              
        <li><span class="phone-number">6666666</span></li> <!-- EXTRACT '6666666' from contents of class="phone-number" div -->
        <li><a href="mailto:mail@mail.com" target="_blank" title="mail@mail.com" class="email">mail@mail.com</a></li> <!-- EXTRACT 'mail@mail.com' from contents of class="email" div -->
        <li><a href="http://www.webpage.com" target="_blank" title="www.webpage.com" class="redirect url">www.webpage.com</a></li> <!-- EXTRACT 'www.webpage.com' from contents of class="redirect url" div -->
      </ul>
    </div>

  </div>

</div>

So far I have the following PHP code (the $output has the webpage's HTML code):

<?php

$doc = new DomDocument();
@$doc->loadHTML($output);
$doc->preserveWhiteSpace = false; 

$xpath = new DomXPath($doc);

$elements = $xpath->query("//*[@class='company']");

if (!is_null($elements)) {
    foreach ($elements as $element) {
        echo $element->nodeValue;
    }
}

?>

It seems that it gets all of the 15 div's with class="company" but I have no idea how to extract the previously mentioned (in comments of HTML code) individual values.

Every div (I am talking about the div with class="company") doesn't have all the values written in the HTML block. So somehow I have to make a query if a specific div inside the company div, where the data i'm interested in, exists and if it exists I have to check if it is not empty (contains text between tags or not). If it exists and is not empty I add it to a variable.

Once values are extracted I would like to assign them to PHP variables which let's me to work with them afterwards. It would be even better if the values extracted are put in array like so:

$result = array(
    // 1'st div's data
    [0] =>  
        'company name' => 'company name',
        'company link' => 'company link',
        'company id' => 'company id',
        'company branches'  => 'branches link',
        'company logo'  => 'logo',
        'company address'  => 'address',
        'company slogan'  => 'slogan',
        'company webpage'  => 'webpage',
        'company email'  => 'email',
        'company phone'  => 'phone'

    // 2'nd div's data
    [1] =>  
        'company name' => 'company name',
        'company link' => 'company link',
        'company id' => 'company id',
        'company branches'  => 'branches link',
        'company logo'  => 'logo',
        'company address'  => 'address',
        'company slogan'  => 'slogan',
        'company webpage'  => 'webpage',
        'company email'  => 'email',
        'company phone'  => 'phone'
    ...
    )
  • 写回答

2条回答

  • dqysi86208 2013-04-12 23:10
    关注

    Each Company can be represented by a context-node while having each property represented by an xpath-expression relative to it:

    Company company-6666:
     ->id ....... = "company-6666"    --    string(@id)
     ->name ..... = "Company Name"    --    .//a[1]/text()
     ->href ..... = "/company-name"    --    .//a[1]/@href
     ->img ...... = "/graphics/company/logo/listing/123456.jpg?_ts=1365390237"    --    .//img[1]/@src
     ->address .. = "StreetName 500, 7777 City, County"    --    .//*[@class="address"]/text()
     ...
    

    If you wrap that into objects, this is pretty nifty to use:

    $doc = new DOMDocument();
    $doc->loadHTML($html);
    
    /* @var $companies DOMValueObject[] */
    $companies = new Companies($doc);
    
    foreach ($companies as $company) {
        printf("Company %s:
    ", $company->id);
        foreach ($company->getObjectProperties() as $name => $value) {
            $expression = $company->getPropertyExpression($name);
            printf(" ->%'.-10s = \"%s\"    --    %s
    ", $name.' ', $value, $expression);
        }
    }
    

    This works with DOMObjectCollection and DOMValueObject, defining your own type:

    class Companies extends DOMValueCollection
    {
        public function __construct(DOMDocument $doc) {
            parent::__construct($doc, '//*[@class="company"]');
        }
    
        /**
         * @return DOMValueObject
         */
        public function current() {
            $object = parent::current();
            $object->defineProperty('id', 'string(@id)');
            $object->defineProperty('name', './/a[1]/text()');
            $object->defineProperty('href', './/a[1]/@href');
            $object->defineProperty('img', './/img[1]/@src');
            $object->defineProperty('address', './/*[@class="address"]/text()');
            # ... add your definitions
            return $object;
        }
    }
    

    And for your array requirements there is a getArrayCopy() method:

    echo "
    Get Array Copy:
    
    ";
    
    print_r($companies->getArrayCopy());
    

    Output:

    Get Array Copy:
    
    Array
    (
        [0] => Array
            (
                [id] => company-6666
                [name] => Company Name
                [href] => /company-name
                [img] => /graphics/company/logo/listing/123456.jpg?_ts=1365390237
                [address] => StreetName 500, 7777 City, County
            )
    
    )
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 stata安慰剂检验作图但是真实值不出现在图上
  • ¥15 c程序不知道为什么得不到结果
  • ¥40 复杂的限制性的商函数处理
  • ¥15 程序不包含适用于入口点的静态Main方法
  • ¥15 素材场景中光线烘焙后灯光失效
  • ¥15 请教一下各位,为什么我这个没有实现模拟点击
  • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
  • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
  • ¥20 有关区间dp的问题求解
  • ¥15 多电路系统共用电源的串扰问题