dpxua26604 2015-05-21 12:32
浏览 41
已采纳

来自xpath查询的元标记内容值是否可信?

I have a php function who extracts meta tags from an url with xpath queries.

e.g $xpath->query('/html/head/meta[@name="my_target"]/@content')

My question :

Can I trust the returned value or should I verify it ?

=> Is there any possible XSS exploit ?

=> Should the html content be purified before loading it in the DOMDocument ?

 // Other way to say it with some code :

    $doc = new DOMDocument;
    $doc->preserveWhiteSpace = false;
    libxml_use_internal_errors(true);

    // is
    $doc->loadHTMLFile($url);
    // trustable ??

    // or is
    file_get_contents($url);
    $trust = $purifier->purify($html);
    $doc->loadHTML($trust);
    // a better practice ??

    libxml_use_internal_errors(false);
    $xpath = new DOMXPath($doc);

    $trustable = $xpath->query('/html/head/meta[@name="my_target"]/@content')->item(0) // ?

===== UPDATE =========================================

Yes, never trust external sources.

use $be_sure = htmlspecialchars($trustable->textContent) or strip_tags($trustable->textContent)

  • 写回答

1条回答 默认 最新

  • drsb77336 2015-05-21 13:16
    关注

    If you pull in HTML content from a source that you don't control, then yes, I would consider that piece of code potentially troublesome!

    You could use htmlspecialchars() to convert any special characters to HTML entities. Or if you want to keep parts of the mark-up, you could use strip_tags(). An other option is to use filter_var() which gives you more control over its filtering.

    Or you could use a library like HTML Purifier but that might be too much for your end. It all depends on the type of content you are working with.

    Now, to sanitise the element, you will need to get the string representation of your XPath result first. Apply your filtering and then put it back in. The following example should do what you want:

    <?php
    // The following HTML is what you fetch from your remote source:
    $html = <<<EOL
    <html>
     <body>
        <h1>Foo, bar!</h1>
        <div id="my-target">
            Here is some <strong>text</strong> <script>javascript:alert('some malicious script!');</script> that we want to sanitize.
        </div>
     </body>
    </html>
    EOL;
    
    // We instantiate a DOCDocument so we can work with it:
    $original = new DOMDocument("1.0", 'UTF-8');
    $original->formatOutput = true;
    $original->loadHTML($html);
    
    $body = $original->getElementsByTagName('body')->item(0);
    
    // Find the element we need using Xpath:
    $xpath = new DOMXPath($original);
    $divs  = $xpath->query("//body/div[@id='my-target']");
    
    // The XPath query will return DOMElement objects, so create a string that we can manipulate out of it:
    $innerHTML  = '';
    if (count($divs))
    {
        $div = $divs->item(0);
    
        // Now get the innerHTML for this element
        foreach ($div->childNodes as $child) {
            $innerHTML .= $original->saveXML($child);
        }
    
        // Remove it from the original document because we want to replace it anyway
        $div->parentNode->removeChild($div);
    }
    
    // Sanitize our string by removing all tags except <strong> and the container <div>
    $innerHTML = strip_tags($innerHTML, '<strong>');
    // or htmlspecialchars() or filter_var or HTML Purifier ..
    
    // Now re-import the sanitized string into a blank DOMDocument
    $sanitized = new DOMDocument("1.0", 'UTF-8');
    $sanitized->formatOutput = true;
    $sanitized->loadXML('<div id="my-target">' . $innerHTML . '</div>');
    
    // Now add the sanitized DOMElement back into the original document as a child of <body>
    $body->appendChild($original->importNode($sanitized->documentElement, true));
    
    echo $original->saveHTML();
    

    Hope that helps.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥100 监控抖音用户作品更新可以微信公众号提醒
  • ¥15 UE5 如何可以不渲染HDRIBackdrop背景
  • ¥70 2048小游戏毕设项目
  • ¥20 mysql架构,按照姓名分表
  • ¥15 MATLAB实现区间[a,b]上的Gauss-Legendre积分
  • ¥15 Macbookpro 连接热点正常上网,连接不了Wi-Fi。
  • ¥15 delphi webbrowser组件网页下拉菜单自动选择问题
  • ¥15 linux驱动,linux应用,多线程
  • ¥20 我要一个分身加定位两个功能的安卓app
  • ¥15 基于FOC驱动器,如何实现卡丁车下坡无阻力的遛坡的效果