I would like to replace a word list (in array) in a list of links (hrefs in array) into an html page.
I think mostly have 2 options:
Doing this from regular expressions (strongly discouraged to parse and change html).
Using a html parser and walking the DOM for each word and link list to replace.
The problems with the 2nd option is as follows:
I don't want to replace links previously created in the html page, which I have to know for each word found in the list in which tag is located it.
I don't want to replace the words on each node of the DOM, only the nodes that have no children, ie only in the leaves.
Easy Example:
$aURLlist = array('www.google.com','www.facebook.com');
$aWordList = array('Google', 'Facebook');
$htmlContent='<html><body><div>Google Inc. is an American multinational corporation specializing in Internet-related services and products.</div><div>Facebook is an online social networking service, whose name stems from the colloquial name for the book given to students at the start of the academic year by some university administrations in the United States to help students get to know each other.</div></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($htmlContent);
$htmlContent=walkingDom($dom,$aURLlist,$aWordList); //replace all words of $aWordList found in text nodes of $dom TO links with href equal to URL in $aURLlist
Result:
$htmlContent=<html><body><div><a href='www.google.com'>Google</a> Inc. is an American multinational corporation specializing in Internet-related services and products.</div><div><a href='www.facebook.com'>Facebook</a> is an online social networking service, whose name stems from the colloquial name for the book given to students at the start of the academic year by some university administrations in the United States to help students get to know each other.</div></body></html>';
I have a recursive function that walks the DOM with DOMDocument lib, but I can't append a "anchor" node to replace a word found in leaf "text" node.
function walkDom($dom, $node, $element, $sRel, $sTarget, $iSearchLinks, $iQuantityTopics, $level = 0, $bLink = false) {
$indent = '';
if ($node->nodeName == 'a') {
$bLink = true;
}
for ($i = 0; $i < $level; $i++)
$indent .= ' ';
if ($node->nodeType != XML_TEXT_NODE) {
//echo $indent . '<b>' . $node->nodeName . '</b>';
//echo $indent . '<b>' . $node->nodeValue . '</b>';
if ($node->nodeType == XML_ELEMENT_NODE) {
$attributes = $node->attributes;
foreach ($attributes as $attribute) {
//echo ', ' . $attribute->name . '=' . $attribute->value;
}
//echo '<br>';
}
} else {
if ($bLink || $node->nodeName == 'img' || $node->nodeName == '#cdata-section' || $node->nodeName == '#comment' || trim($node->nodeValue) == '') {
continue;
//echo $indent;
//echo 'NO replace: ';
//var_dump($node->nodeValue);
//echo '<br><br>';
} elseif (!$bLink && $node->nodeName != 'img' && trim($node->nodeValue) != '') {
//echo $indent;
//echo "TEXT TO REPLACE: $element, $replace, $node->nodeValue, $iSearchLinks <br>";
$i = 0;
$n = 1;
while (i != $iSearchLinks && $n > 0 ) {
//echo "Create link? <br>";
$node->nodeValue = preg_replace('/'.$element->name.'/', '', $node->nodeValue, 1, $n);
if ($n > 0) {
//echo "Creating link with $element->name <br>";
$link = $dom->createElement("a", $element->name);
$link->setAttribute("class", "nl_tag");
$link->setAttribute("id", "@@ID@@");
$link->setAttribute("hreflang", $element->type);
$link->setAttribute("title", $element->altname);
$link->setAttribute("href", $element->resource);
if ($sRel == "nofollow") $link->setAttribute("rel", $sRel);
if ($sTarget == "_blank") $link->setAttribute("target", $sTarget);
$node->parentNode->appendChild($link);
//var_dump($node->parentNode);
$dom->encoding = 'UTF-8';
$dom->saveHTML();
$iQuantityTopics++;
}
$i++;
//saveHTML?
//echo '<br><br>';
}
}
}
This solution don't work, becouse appendChild function adds new child at the end of the children only, but I want to add it where found word to replace is located.
I've also tried to add link directy with preg_replace function into leaf text node, but the anchor is added as "text format" into text node, and I need to add it as a link node to replace the word within leaf text node where is located.
My question is: is it possible to do this with html parser in PHP, or necessarily I have to resort to regular expressions? Thanks in advance!