doufan9805 2016-04-09 17:48
浏览 100
已采纳

脚本标记中的Symfony dom-crawler字符串转换为UTF8

I have this HTML content:

<div>测试</div>
<script charset="utf-8" type="text/javascript">
    function drawCharts(){
        console.log('测试');
    }
</script>

When I use the Symfony's dom-crawler, the text is being HTML encoded. How can I prevent this? $crawler->html() results:

<div>测试</div>
<script>
    function drawCharts(){
        console.log('&#27979;&#35797;');
    }
  • 写回答

1条回答 默认 最新

  • dquv73115 2016-12-25 15:02
    关注

    Let's see how symfony/dom-crawler works. Here's an example to start with:

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler($html);
    
    print $crawler->html();
    

    It outputs:

    <div>æµè¯</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('&aelig;&micro;&#139;&egrave;&macr;&#149;');
        }
    </script>
    

    When you pass the content through the constructor, the Crawler class does its best to figure out the encoding. If it fails to figure anything out, it falls back to ISO-8859-1; which is the default charset defined by the HTTP 1.1 specification.

    If your HTML content contains a charset meta tag, the Crawler class will read the charset from it, set it and convert from it properly. Here's the same above example with a charset meta tag prepended to the HTML content:

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <meta charset="utf-8">
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler($html);
    
    print $crawler->html();
    

    Now it prints:

    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('&#27979;&#35797;');
        }
    </script>
    

    If you don't want to add the charset meta tag, there's another way; addHTMLContent() method accepts a charset as its second argument and it defaults to UTF-8. Instead of passing the HTML content through the constructor, first instantiate the class and then add the content using this method:

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler;
    
    // You can safely drop the 2nd argument
    $crawler->addHTMLContent($html, 'UTF-8');     
    
    print $crawler->html();
    

    Now, without a charset meta tag, it prints:

    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('&#27979;&#35797;');
        }
    </script>
    

    OK, you may already knew all of this. So, what's with the &#27979;&#35797;? Why the div content are showing as is, but the same content in the script tag is getting html-encoded?

    Symfony's Crawler class, as it explains itself, converts the content to HTML entities due to a bug in DOMDocument::loadHTML():

    When using loadHTML() to process UTF-8 pages, you may meet the problem that the output of DOM functions are not like the input. For example, if you want to get "Cạnh tranh", you will receive "Cạnh tranh". I suggest we use mb_convert_encoding before loading UTF-8 page.
    https://php.net/manual/en/domdocument.loadhtml.php#74777

    Some suggest to add a HTML4 Content-Type meta tag into the head element. Some other suggest to prepend a <?xml encoding="UTF-8"> to the HTML content before passing it to loadHTML(). As your HTML structure is not complete (lacks head, body, etc.), I recommend you simply pass the output to html_entity_decode():

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler();
    $crawler->addHTMLContent($html, 'UTF-8');
    
    print html_entity_decode($crawler->html());
    

    Outputs:

    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    

    Which is what you want.

    You might also want to read:
    PHP DOMDocument loadHTML not encoding UTF-8 correctly

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
  • ¥50 有数据,怎么建立模型求影响全要素生产率的因素
  • ¥50 有数据,怎么用matlab求全要素生产率
  • ¥15 TI的insta-spin例程
  • ¥15 完成下列问题完成下列问题
  • ¥15 C#算法问题, 不知道怎么处理这个数据的转换
  • ¥15 YoloV5 第三方库的版本对照问题
  • ¥15 请完成下列相关问题!
  • ¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像,如何解决?
  • ¥15 求daily translation(DT)偏差订正方法的代码