doufan9805 2016-04-09 17:48
浏览 100
已采纳

脚本标记中的Symfony dom-crawler字符串转换为UTF8

I have this HTML content:

<div>测试</div>
<script charset="utf-8" type="text/javascript">
    function drawCharts(){
        console.log('测试');
    }
</script>

When I use the Symfony's dom-crawler, the text is being HTML encoded. How can I prevent this? $crawler->html() results:

<div>测试</div>
<script>
    function drawCharts(){
        console.log('&#27979;&#35797;');
    }
  • 写回答

1条回答 默认 最新

  • dquv73115 2016-12-25 15:02
    关注

    Let's see how symfony/dom-crawler works. Here's an example to start with:

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler($html);
    
    print $crawler->html();
    

    It outputs:

    <div>æµè¯</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('&aelig;&micro;&#139;&egrave;&macr;&#149;');
        }
    </script>
    

    When you pass the content through the constructor, the Crawler class does its best to figure out the encoding. If it fails to figure anything out, it falls back to ISO-8859-1; which is the default charset defined by the HTTP 1.1 specification.

    If your HTML content contains a charset meta tag, the Crawler class will read the charset from it, set it and convert from it properly. Here's the same above example with a charset meta tag prepended to the HTML content:

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <meta charset="utf-8">
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler($html);
    
    print $crawler->html();
    

    Now it prints:

    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('&#27979;&#35797;');
        }
    </script>
    

    If you don't want to add the charset meta tag, there's another way; addHTMLContent() method accepts a charset as its second argument and it defaults to UTF-8. Instead of passing the HTML content through the constructor, first instantiate the class and then add the content using this method:

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler;
    
    // You can safely drop the 2nd argument
    $crawler->addHTMLContent($html, 'UTF-8');     
    
    print $crawler->html();
    

    Now, without a charset meta tag, it prints:

    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('&#27979;&#35797;');
        }
    </script>
    

    OK, you may already knew all of this. So, what's with the &#27979;&#35797;? Why the div content are showing as is, but the same content in the script tag is getting html-encoded?

    Symfony's Crawler class, as it explains itself, converts the content to HTML entities due to a bug in DOMDocument::loadHTML():

    When using loadHTML() to process UTF-8 pages, you may meet the problem that the output of DOM functions are not like the input. For example, if you want to get "Cạnh tranh", you will receive "Cạnh tranh". I suggest we use mb_convert_encoding before loading UTF-8 page.
    https://php.net/manual/en/domdocument.loadhtml.php#74777

    Some suggest to add a HTML4 Content-Type meta tag into the head element. Some other suggest to prepend a <?xml encoding="UTF-8"> to the HTML content before passing it to loadHTML(). As your HTML structure is not complete (lacks head, body, etc.), I recommend you simply pass the output to html_entity_decode():

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler();
    $crawler->addHTMLContent($html, 'UTF-8');
    
    print html_entity_decode($crawler->html());
    

    Outputs:

    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    

    Which is what you want.

    You might also want to read:
    PHP DOMDocument loadHTML not encoding UTF-8 correctly

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 乌班图ip地址配置及远程SSH
  • ¥15 怎么让点阵屏显示静态爱心,用keiluVision5写出让点阵屏显示静态爱心的代码,越快越好
  • ¥15 PSPICE制作一个加法器
  • ¥15 javaweb项目无法正常跳转
  • ¥15 VMBox虚拟机无法访问
  • ¥15 skd显示找不到头文件
  • ¥15 机器视觉中图片中长度与真实长度的关系
  • ¥15 fastreport table 怎么只让每页的最下面和最顶部有横线
  • ¥15 java 的protected权限 ,问题在注释里
  • ¥15 这个是哪里有问题啊?