dongzhui9936 2013-05-08 11:48
浏览 70

在PHP中行走Dom以替换发现为“HTML文本”的字符串列表

I would like to replace a word list (in array) in a list of links (hrefs in array) into an html page.

I think mostly have 2 options:

  1. Doing this from regular expressions (strongly discouraged to parse and change html).

  2. Using a html parser and walking the DOM for each word and link list to replace.

The problems with the 2nd option is as follows:

  1. I don't want to replace links previously created in the html page, which I have to know for each word found in the list in which tag is located it.

  2. I don't want to replace the words on each node of the DOM, only the nodes that have no children, ie only in the leaves.

Easy Example:

$aURLlist = array('www.google.com','www.facebook.com');
$aWordList = array('Google', 'Facebook');
$htmlContent='<html><body><div>Google Inc. is an American multinational corporation specializing in Internet-related services and products.</div><div>Facebook is an online social networking service, whose name stems from the colloquial name for the book given to students at the start of the academic year by some university administrations in the United States to help students get to know each other.</div></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($htmlContent);
$htmlContent=walkingDom($dom,$aURLlist,$aWordList); //replace all words of $aWordList found in text nodes of $dom TO links with href equal to URL in $aURLlist

Result:

$htmlContent=<html><body><div><a href='www.google.com'>Google</a> Inc. is an American multinational corporation specializing in Internet-related services and products.</div><div><a href='www.facebook.com'>Facebook</a> is an online social networking service, whose name stems from the colloquial name for the book given to students at the start of the academic year by some university administrations in the United States to help students get to know each other.</div></body></html>';

I have a recursive function that walks the DOM with DOMDocument lib, but I can't append a "anchor" node to replace a word found in leaf "text" node.

function walkDom($dom, $node, $element, $sRel, $sTarget, $iSearchLinks, $iQuantityTopics, $level = 0, $bLink = false) {
    $indent = '';
    if ($node->nodeName == 'a') {
        $bLink = true;
    }
    for ($i = 0; $i < $level; $i++)
        $indent .= '&nbsp;&nbsp;';
    if ($node->nodeType != XML_TEXT_NODE) {
        //echo $indent . '<b>' . $node->nodeName . '</b>';
        //echo $indent . '<b>' . $node->nodeValue . '</b>';

        if ($node->nodeType == XML_ELEMENT_NODE) {
            $attributes = $node->attributes;
            foreach ($attributes as $attribute) {
                //echo ', ' . $attribute->name . '=' . $attribute->value;
            }
            //echo '<br>';
        }
    } else {
        if ($bLink || $node->nodeName == 'img' || $node->nodeName == '#cdata-section' || $node->nodeName == '#comment' || trim($node->nodeValue) == '') {
            continue;
            //echo $indent;
            //echo 'NO replace: ';
            //var_dump($node->nodeValue);
            //echo '<br><br>';
        } elseif (!$bLink && $node->nodeName != 'img' && trim($node->nodeValue) != '') {
            //echo $indent;
            //echo "TEXT TO REPLACE: $element, $replace, $node->nodeValue, $iSearchLinks  <br>";
            $i = 0;
            $n = 1;
            while (i != $iSearchLinks && $n > 0 ) {
                //echo "Create link? <br>";

                $node->nodeValue = preg_replace('/'.$element->name.'/', '', $node->nodeValue, 1, $n);
                if ($n > 0) {
                    //echo "Creating link with $element->name <br>";
                    $link = $dom->createElement("a", $element->name);
                    $link->setAttribute("class", "nl_tag");
                    $link->setAttribute("id", "@@ID@@");
                    $link->setAttribute("hreflang", $element->type);
                    $link->setAttribute("title", $element->altname);
                    $link->setAttribute("href", $element->resource);
                    if ($sRel == "nofollow") $link->setAttribute("rel", $sRel);
                    if ($sTarget == "_blank") $link->setAttribute("target", $sTarget);
                    $node->parentNode->appendChild($link);
                    //var_dump($node->parentNode);
                    $dom->encoding = 'UTF-8';
                    $dom->saveHTML();
                    $iQuantityTopics++;
                }
                $i++;
                //saveHTML?
                //echo '<br><br>';
            }
        }
    }

This solution don't work, becouse appendChild function adds new child at the end of the children only, but I want to add it where found word to replace is located.

I've also tried to add link directy with preg_replace function into leaf text node, but the anchor is added as "text format" into text node, and I need to add it as a link node to replace the word within leaf text node where is located.

My question is: is it possible to do this with html parser in PHP, or necessarily I have to resort to regular expressions? Thanks in advance!

  • 写回答

1条回答 默认 最新

  • duanjue7508 2014-02-17 19:56
    关注

    @Suamere:

    "I'm not sure what the PHP engine doesn't support: (?i)(?<!<[^>]*|>)(strWord)(?!<|[^<]*>)"
    (?i) - Yes, although it would be easier to just put i at the end:

    /(someregex)/i<br>
    (?&lt;!<[^>]\*|>)
    

    You're looking for a leading tag here; I got this to work by deleting the first < (sort of)

    So here's what the final regex looked like that was as close as possible to what you're trying to do:

    /(?!<[^>]\*>).\*(strWord).\*(?!<\/[^<]\*>)/i<br>
    

    However, a much simpler approach would be something like:

    $text = "...";<br>
    $words = array('him', 'her', ...);<br>
    $links = array('&lt;a href="...">$0&lt;/a>', ...);<br>
    
    foreach ($words as $word) {<br>
    &emsp;array_push($regexes, "/\b{$word}\b/i");<br>
    }<br>
    $modified_array = preg_replace($regexes, $links, $text);<br>
    

    It's important that $words and $links have the exact same number of elements; otherwise an error will be thrown.

    $0 references the entire match of the corresponding regex; in this case, only the specific word you're looking for itself.

    Also, preg_replace() applies the /g modifier by default, so that modifier is not needed on each regex. :-)

    评论

报告相同问题?

悬赏问题

  • ¥15 uniapp uview http 如何实现统一的请求异常信息提示?
  • ¥15 有了解d3和topogram.js库的吗?有偿请教
  • ¥100 任意维数的K均值聚类
  • ¥15 stamps做sbas-insar,时序沉降图怎么画
  • ¥15 买了个传感器,根据商家发的代码和步骤使用但是代码报错了不会改,有没有人可以看看
  • ¥15 关于#Java#的问题,如何解决?
  • ¥15 加热介质是液体,换热器壳侧导热系数和总的导热系数怎么算
  • ¥100 嵌入式系统基于PIC16F882和热敏电阻的数字温度计
  • ¥20 BAPI_PR_CHANGE how to add account assignment information for service line
  • ¥500 火焰左右视图、视差(基于双目相机)