douou6696 2016-11-20 14:44
浏览 48
已采纳

php DOMDocument提取与锚点或alt的链接

I which to extract all the link include on page with anchor or alt attribute on image include in the links if this one come first.

$html = '<a href="lien.fr">Anchor</a>';

Must return "lien.fr;Anchor"

$html = '<a href="lien.fr"><img alt="Alt Anchor">Anchor</a>';

Must return "lien.fr;Alt Anchor"

$html = '<a href="lien.fr">Anchor<img alt="Alt Anchor"></a>';

Must return "lien.fr;Anchor"

I did:

$doc = new DOMDocument();
$doc->loadHTML($html);

$out = "";
$n = 0;
$links = $doc->getElementsByTagName('a');

foreach ($links as $element) {
    $href = $img_alt = $anchor = "";
    $href = $element->getAttribute('href');
    $n++;
    if (!strrpos($href, "panier?")) {

        if ($element->firstChild->nodeName == "img") {

            $imgs = $element->getElementsByTagName('img');

            foreach ($imgs as $img) {
                if ($anchor = $img->getAttribute('alt')) {
                    break;
                }
            }
        }

        if (($anchor == "") && ($element->nodeValue)) {
            $anchor = $element->nodeValue;
        }

        $out[$n]['link'] = $href;
        $out[$n]['anchor'] = $anchor;
    }
}

This seems to work but if there some space or indentation it doesn't as

$html = '<a href="link.fr">
                    <img src="ceinture-gris" alt="alt anchor"/>
                </a>';

the $element->firstChild->nodeName will be text

  • 写回答

1条回答 默认 最新

  • dongluanjie8678 2016-11-20 15:13
    关注

    Something like this:

    $doc = new DOMDocument();
    $doc->loadHTML($html);
    
    // Output texts that will later be joined with ';'
    $out = [];
    // Maximum number of items to add to $out
    $max_out_items = 2;
    // List of img tag attributes that will be parsed by the loop below
    // (in the order specified in this array!)
    $img_attributes = ['alt', 'src', 'title'];
    
    $links = $doc->getElementsByTagName('a');
    foreach ($links as $element) {
      if ($href = trim($element->getAttribute('href'))) {
        $out []= $href;
        if (count($out) >= $max_out_items)
          break;
      }
    
      foreach ($element->childNodes as $child) {
        if ($child->nodeType === XML_TEXT_NODE &&
          $text = trim($child->nodeValue))
        {
          $out []= $text;
          if (count($out) >= $max_out_items)
            break;
        } elseif ($child->nodeName == 'img') {
          foreach ($img_attributes as $attr_name) {
            if ($attr_value = trim($child->getAttribute($attr_name))) {
              $out []= $attr_value;
              if (count($out) >= $max_out_items)
                goto Result;
            }
          }
        }
      }
    }
    
    Result:
    echo $out = implode(';', $out);
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 在rhel8中安装qemu-kvm时遇到“cannot initialize crypto:unable to initialize gcrypt“报错”
  • ¥15 arbotix没有/cmd_vel话题
  • ¥15 paddle库安装时报错提示需要安装common、dual等库,安装了上面的库以后还是显示报错未安装,要怎么办呀?
  • ¥20 找能定制Python脚本的
  • ¥15 odoo17的分包重新供应路线如何设置?可从销售订单中实时直接触发采购订单或相关单据
  • ¥15 用C语言怎么判断字符串的输入是否符合设定?
  • ¥15 通信专业本科生论文选这两个哪个方向好研究呀
  • ¥50 我在一个购物网站的排队系统排队,这个排队到号后重新定向到目标网站进行购物,但是有技术牛通过技术方法直接跳过排队系统进入目标网址购物,有没有什么软件或者脚本可以用
  • ¥15 ios可以实现ymodem-1k协议 1024字节传输吗?
  • ¥300 寻抓云闪付tn组成网页付款链接