php包括xpath文本echo中包含的任何href链接

Trying to get good at php web scrapping. Doing some tests and I've nailed scraping/echoing that information from one site to another, but I'm unable to also include the original links in the source code, which is what I'd ideally like to do. Any thoughts on how to accomplish this with what I've got thurs far? (I'm very new to php btw).

this is the php code:

// news
$doc = new DOMDocument;

// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;

// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;

$doc->loadHTMLFile('https://www.usatoday.com/');

$xpath = new DOMXPath($doc);

$query = "//ul[@class='hfwmm-list hfwmm-4uphp-list hfwmm-light-list']";

$entries = $xpath->query($query);
foreach ($entries as $entry) {
 echo trim($entry->textContent);  // use `trim` to eliminate spaces
}

that code is spitting out this: NBA Cavs win record-breaking Game 4 behind Irving's 40 Entertain This Watch: 'Black Panther' trailer unleashes a fearsome king News Police: London Bridge terrorists planned more bloodshed How Trump is highlighting divisions amo..........

Now what I'd really like to do, is actually have those as working links, which was what it was in the original code. this is what the source code for this information looked like:

<div class="partner-heroflip-ad partner-placement ui-flip-panel size-xxs"><a 
href="#" class="partner-close"></a></div></div><p class="hfwmm-tertiary-
list-title hfwmm-light-tertiary-list-title">TOP STORIES</p><ul class="hfwmm-
list hfwmm-4uphp-list hfwmm-light-list"
data-track-prefix="flex4uphphero"><li class="hfwmm-item hfwmm-secondary-item 
hfwmm-item-2 sports-theme-bg hfwmm-first-secondary-item hfwmm-4uphp-
secondary-item"
data-asset-position="1"
data-asset-id="102694848"
 ><a class="js-asset-link hfwmm-list-link hfwmm-light-list-link hfwmm-image-
link hfwmm-secondary-link
href="/story/sports/nba/2017/06/10/kyrie-irving-lebron-james-cavs-win-game-
4/102694848/"
data-track-display-type="thumb"
data-ht="flex4uphpherostack1"
data-asset-id="102694848"                 
><span class="hfwmm-image-gradient hfwmm-secondary-image-gradient"></span>
<span class="js-asset-section theme-bg-ssts-label hfwmm-ssts-label-top-left 
hfwmm-ssts-label-secondary sports-theme-bg">NBA</span><img 
src="https://www.gannett-cdn.com/-
mm-/cd17823b265aa373c83094fc75525710f645ec90/c=0-178-4072-
81338209183-USP-NBA-FINALS-GOLDEN-STATE-WARRIORS-AT-CLEVELAND-91573076.JPG"
 class="hfwmm-image hfwmm-secondary-image js-asset-image placeholder-hide"
  alt="Kyrie Irving reacts after making a basket against the"
  data-id="102695338"
  data-crop="16_9"
  width="239"
  height="135" /><span class="hfwmm-secondary-hed-wrap hfwmm-secondary-text-
hed-wrap"><span class="hfwmm-text-hed-icon js-asset-disposable"></span><span
  title="Cavs win record-breaking Game 4 behind Irving&#39;s 40"
  class="js-asset-headline hfwmm-list-hed hfwmm-secondary-hed placeholder-
hide">
      Cavs win record-breaking Game 4 behind Irving&#39;s 40
     hfwmm-item-3 life-theme-bg hfwmm-4uphp-secondary-item"
   data-asset-position="2"

For sanity, the href above is href="/story/sports/nba/2017/06/10/kyrie-irving-lebron-james-cavs-win-game- 4/102694848/"

Any thoughts on how this might be accomplished in this test scenario, would be hugely helpful. Thank you very much. -wilson

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
duanpasi6287 2017-06-10 07:14
关注
You need to output the element as a string, your just extracting the text of the element (not the same thing with XML). The element may be <a>some text</a> the text is simply some text.

To output the tags, use...

$query = "//ul[@class='hfwmm-list hfwmm-4uphp-list hfwmm-light-list']//a"; $entries = $xpath->query($query); foreach ($entries as $entry) { $newdoc = new DOMDocument(); $cloned = $entry->cloneNode(TRUE); $newdoc->appendChild($newdoc->importNode($cloned,TRUE)); echo $newdoc->saveHTML(); //echo trim((string)$entry); // use `trim` to eliminate spaces }

Also note that I've added //a on the end of the XPath expression to limit the selection to links in the segment you where fetching. This may or may not be what you want, but look at the results and check it out.

Edit:

To manipulate the href in the , then use something like...

foreach ($entries as $entry) { $oldHref = (string)$entry->getAttribute("href"); $entry->setAttribute("href", "http://someserver.com".$oldHref); $newdoc = new DOMDocument(); $cloned = $entry->cloneNode(TRUE); $newdoc->appendChild($newdoc->importNode($cloned,TRUE)); echo $newdoc->saveHTML(); }
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

php包括xpath文本echo中包含的任何href链接 php
2017-06-10 06:29

回答 1 已采纳 You need to output the element as a string, your just extracting the text of the element (not the
PHP DOMXPath在td中提取锚点的href php
2017-11-10 09:52

回答 1 已采纳 With the structure you posted, the following outputs the href-value: <?php $dom = new DOMDocum
在PHP中使用XPath获取href属性 php
2015-06-06 09:23

回答 1 已采纳 To get all href attributes of the hyperlinks, add some more axis steps, finally loop over the resu
php 获取锚文本,Xpath表达式获取href。不只是锚文本 - php
2021-04-09 10:13

jiyulishang的博客尝试使用xpath表达式来学习它。我找到了一个代码段，并对其进行了一些调整。我想做的是获取页面上的每个链接。$baseurl = "http://www.example.com";$html = file_get_contents($baseurl);$dom = new DOMDocument();...
在PHP中使用XPath替换XML属性 php xml
2019-06-11 17:26

回答 1 已采纳 The answer as Nigel Ren suggested was just to remove these two lines, as they no longer apply: $
在Xpath查询中排除链接 php
2018-12-23 22:25

回答 1 已采纳 You can exclude link text nodes from results with //div[@class="intro"]//text()[not(parent::a)]
使用DOMXPath在PHP中调用XML数据 php xml
2018-10-01 03:03

回答 1 已采纳 The problem is that there is a namespace on your VehicleDescription element. You need to register
php xpath类库,PHP Xpath：获取包含针的所有href值
2021-03-24 11:47

weixin_39787397的博客不确定我正确地理解了这个问题,但是第二个XPath表达式已经做了你所描述的内容.它与A元素的文本节点不匹配,但href属性：$html = <<< HTMLDescriptionDescriptionHTML;...xpath("//a[contains(@href,'foo...
无法通过PHP解析页面中的链接（href） php
2017-08-23 10:20

回答 1 已采纳 SOLVED :) Well. If it's stupid but it works, then it aint stupid :D Just added the following cod
如何在php中使用curl xpath在网站上获取特定图片 php
2017-04-28 22:04

回答 1 已采纳 Assuming you want the image the appears next to the first headline, the XPath is: function news($
PHP SimpleXMLElement xpath php
2018-03-22 19:02

回答 1 已采纳 This gives me an empty array! No it doesn't. Look closely at your output, and you will see th
php获取网站所有链接地址,php获取指定URL页面中的所有链接
2021-04-20 04:15

weixin_39699313的博客今日给大伙儿共享一篇技术性文章内容，教大伙儿怎样在php中获得到特定URL网页页面中的全部连接，即全部a标签的href特性：//获得连接的HTML编码$html=file_get_contents('http://www.example.com');$dom=...
获取文本td Xpath PHP php
2013-07-18 15:19

回答 2 已采纳 foreach($lines as $line) { for($j=0; $j<=3; $j++) {
PHP的html实现xpath解析,php用xpath解析html的代码实例讲解
2021-03-24 08:18

刘二婷ttt的博客 php用xpath解析html的代码实例讲解实例1$xml = simplexml_load_file('https://forums.eveonline.com');$names = $xml->xpath("html/body/p/p/form/p/p/p/p/p[*]/p/p/table//tr/td[@class='topicViews']");foreach...
php 本页链接,php获取指定URL页面中的所有链接
2021-04-10 11:23

刘天鸟的博客今天给大家分享一篇技术文章，教大家如何在php中获取到指定URL页面中的所有链接，即所有a标签的href属性：//获取链接的HTML代码$html=file_get_contents('http://www.example.com');$dom=newDOMDocument();@$dom-&gt...
没有解决我的问题, 去提问

悬赏问题

¥15 2020长安杯与连接网探
¥15 关于#matlab#的问题：在模糊控制器中选出线路信息，在simulink中根据线路信息生成速度时间目标曲线（初速度为20m/s，15秒后减为0的速度时间图像）我想问线路信息是什么
¥15 banner广告展示设置多少时间不怎么会消耗用户价值
¥16 mybatis的代理对象无法通过@Autowired装填
¥15 可见光定位matlab仿真
¥15 arduino 四自由度机械臂
¥15 wordpress 产品图片 GIF 没法显示
¥15 求三国群英传pl国战时间的修改方法
¥15 matlab代码代写，需写出详细代码，代价私
¥15 ROS系统搭建请教（跨境电商用途）

php包括xpath文本echo中包含的任何href链接

1条回答 默认 最新

悬赏问题

1条回答默认最新