dongshi2141
2017-10-04 21:55
浏览 102
已采纳

使用xpath从网页刮取特定文本

I've searched and tried multiple ways to get this but I'm not sure why it won't find most of the information on the webpage.

Page to scrape: https://m.safeguardproperties.com/

Info needed: Version number for PhotoDirect for Apple (currently 4.4.0)

Xpath to text needed (I think) : /html/body/div[1]/div[2]/div[1]/div[4]/div[3]/a

Attempts:

<?php

$file = "https://m.safeguardproperties.com/";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);

$xpath = new DOMXpath($doc);

$elements = $xpath->query("/html/body/div[1]/div[2]/div[1]/div[4]/div[3]/a");

echo "<PRE>";

if (!is_null($elements)) {
  foreach ($elements as $element) {
      var_dump ($element);
    echo "<br/>[". $element->nodeName. "]";

    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
      echo $node->nodeValue. "
";
    }
  }
}

echo "</PRE>";

?>

Second Attempt:

<?PHP
$file = "https://m.safeguardproperties.com/";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);

echo '<pre>';

  // trying to find all links in document to see if I can see the correct one
  $links = [];
  $arr = $doc->getElementsByTagName("a");

  foreach($arr as $item) { 
    $href =  $item->getAttribute("href");
    $text = trim(preg_replace("/[
]+/", " ", $item->nodeValue));
    $links[] = [
      'href' => $href,
      'text' => $text
    ];
  }

var_dump($links);
echo '</pre>';
?>

图片转代码服务由CSDN问答提供 功能建议

我已经搜索并尝试了多种方法来获取此信息,但我不确定为什么它找不到 网页上的信息。

要刮的页面: https:/ /m.safeguardproperties.com/

需要的信息: PhotoDirect for Apple的版本号(目前为4.4.0)

Xpath 需要的文字(我认为):/ html / body / div [1] / div [2] / div [1] / div [4] / div [3] / a

尝试:

 &lt;?php 
 
 $ file =“https://m.safeguardproperties.com/";
$doc = new DOMDocument();  
 $ doc-&gt; loadHTMLFile($ file); 
 
 $ xpath = new DOMXpath($ doc); 
 
 $ elements = $ xpath-&gt; query(“/ html / body / div [1  ] / div [2] / div [1] / div [4] / div [3] / a“); 
 
echo”&lt; PRE&gt;“; 
 
if(!is_null($ elements)){  
 foreach($ elements as $ element){
 var_dump($ element); 
 echo“&lt; br /&gt; [”。  $元素 - &GT;节点名称。  “]”; 
 
 $ nodes = $ element-&gt; childNodes; 
 foreach($ nodes as $ node){
 echo $ node-&gt; nodeValue。  “
”; 
} 
} 
} 
 
echo“&lt; / PRE&gt;”; 
 
?&gt; 
   
 
 

第二次尝试:

 &lt;?PHP 
 $ file =“https://m.safeguardproperties.com/";
$doc = new DOMDocument(); \  n $ doc-&gt; loadHTMLFile($ file); 
 
echo'&lt; pre&gt;'; 
 
 //尝试查找文档中的所有链接,看看我是否能看到正确的链接
 $ links =  []; 
 $ arr = $ doc-&gt; getElementsByTagName(“a”); 
 
 foreach($ arr as $ item){
 $ href = $ item-&gt; getAttribute(“href”);  
 $ text = trim(preg_replace(“/ [
 
] + /”,“”,$ item-&gt; nodeValue)); 
 $ links [] = [
'href'=&gt;  $ href,
'text'=&gt;  $ text 
]; 
} 
 
var_dump($ links); 
echo'&lt; / pre&gt;'; 
?&gt; 
   
 
  • 写回答
  • 关注问题
  • 收藏
  • 邀请回答

1条回答 默认 最新

  • donglin4636 2017-10-04 22:18
    已采纳

    For that particular website, the versions are being loaded from JSON data client side, you won't find them in the base document.

    http://m.safeguardproperties.com/js/photodirect.json

    This was located by comparing the original document source to the finished DOM and inspecting the network activity in the developer console.

    $url = 'https://m.safeguardproperties.com/js/photodirect.json';
    $json = file_get_contents( $url );
    $object = json_decode( $json );
    echo $object->ios->version; //4.4.0
    

    Please respect other websites and cache your GET request.

    已采纳该答案
    打赏 评论

相关推荐 更多相似问题