doufan9805 2016-04-09 17:48
浏览 100
已采纳

脚本标记中的Symfony dom-crawler字符串转换为UTF8

I have this HTML content:

<div>测试</div>
<script charset="utf-8" type="text/javascript">
    function drawCharts(){
        console.log('测试');
    }
</script>

When I use the Symfony's dom-crawler, the text is being HTML encoded. How can I prevent this? $crawler->html() results:

<div>测试</div>
<script>
    function drawCharts(){
        console.log('&#27979;&#35797;');
    }
  • 写回答

1条回答 默认 最新

  • dquv73115 2016-12-25 15:02
    关注

    Let's see how symfony/dom-crawler works. Here's an example to start with:

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler($html);
    
    print $crawler->html();
    

    It outputs:

    <div>æµè¯</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('&aelig;&micro;&#139;&egrave;&macr;&#149;');
        }
    </script>
    

    When you pass the content through the constructor, the Crawler class does its best to figure out the encoding. If it fails to figure anything out, it falls back to ISO-8859-1; which is the default charset defined by the HTTP 1.1 specification.

    If your HTML content contains a charset meta tag, the Crawler class will read the charset from it, set it and convert from it properly. Here's the same above example with a charset meta tag prepended to the HTML content:

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <meta charset="utf-8">
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler($html);
    
    print $crawler->html();
    

    Now it prints:

    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('&#27979;&#35797;');
        }
    </script>
    

    If you don't want to add the charset meta tag, there's another way; addHTMLContent() method accepts a charset as its second argument and it defaults to UTF-8. Instead of passing the HTML content through the constructor, first instantiate the class and then add the content using this method:

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler;
    
    // You can safely drop the 2nd argument
    $crawler->addHTMLContent($html, 'UTF-8');     
    
    print $crawler->html();
    

    Now, without a charset meta tag, it prints:

    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('&#27979;&#35797;');
        }
    </script>
    

    OK, you may already knew all of this. So, what's with the &#27979;&#35797;? Why the div content are showing as is, but the same content in the script tag is getting html-encoded?

    Symfony's Crawler class, as it explains itself, converts the content to HTML entities due to a bug in DOMDocument::loadHTML():

    When using loadHTML() to process UTF-8 pages, you may meet the problem that the output of DOM functions are not like the input. For example, if you want to get "Cạnh tranh", you will receive "Cạnh tranh". I suggest we use mb_convert_encoding before loading UTF-8 page.
    https://php.net/manual/en/domdocument.loadhtml.php#74777

    Some suggest to add a HTML4 Content-Type meta tag into the head element. Some other suggest to prepend a <?xml encoding="UTF-8"> to the HTML content before passing it to loadHTML(). As your HTML structure is not complete (lacks head, body, etc.), I recommend you simply pass the output to html_entity_decode():

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler();
    $crawler->addHTMLContent($html, 'UTF-8');
    
    print html_entity_decode($crawler->html());
    

    Outputs:

    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    

    Which is what you want.

    You might also want to read:
    PHP DOMDocument loadHTML not encoding UTF-8 correctly

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 cplex运行后参数报错是为什么
  • ¥15 之前不小心删了pycharm的文件,后面重新安装之后软件打不开了
  • ¥15 vue3获取动态宽度,刷新后动态宽度值为0
  • ¥15 升腾威讯云桌面V2.0.0摄像头问题
  • ¥15 关于Python的会计设计
  • ¥15 聚类分析 设计k-均值算法分类器,对一组二维模式向量进行分类。
  • ¥15 stm32c8t6工程,使用hal库
  • ¥15 找能接spark如图片的,可议价
  • ¥15 关于#单片机#的问题,请各位专家解答!
  • ¥15 博通raid 的写入速度很高也很低