doufan9805 2016-04-09 17:48
浏览 100
已采纳

脚本标记中的Symfony dom-crawler字符串转换为UTF8

I have this HTML content:

<div>测试</div>
<script charset="utf-8" type="text/javascript">
    function drawCharts(){
        console.log('测试');
    }
</script>

When I use the Symfony's dom-crawler, the text is being HTML encoded. How can I prevent this? $crawler->html() results:

<div>测试</div>
<script>
    function drawCharts(){
        console.log('&#27979;&#35797;');
    }
  • 写回答

1条回答 默认 最新

  • dquv73115 2016-12-25 15:02
    关注

    Let's see how symfony/dom-crawler works. Here's an example to start with:

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler($html);
    
    print $crawler->html();
    

    It outputs:

    <div>æµè¯</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('&aelig;&micro;&#139;&egrave;&macr;&#149;');
        }
    </script>
    

    When you pass the content through the constructor, the Crawler class does its best to figure out the encoding. If it fails to figure anything out, it falls back to ISO-8859-1; which is the default charset defined by the HTTP 1.1 specification.

    If your HTML content contains a charset meta tag, the Crawler class will read the charset from it, set it and convert from it properly. Here's the same above example with a charset meta tag prepended to the HTML content:

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <meta charset="utf-8">
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler($html);
    
    print $crawler->html();
    

    Now it prints:

    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('&#27979;&#35797;');
        }
    </script>
    

    If you don't want to add the charset meta tag, there's another way; addHTMLContent() method accepts a charset as its second argument and it defaults to UTF-8. Instead of passing the HTML content through the constructor, first instantiate the class and then add the content using this method:

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler;
    
    // You can safely drop the 2nd argument
    $crawler->addHTMLContent($html, 'UTF-8');     
    
    print $crawler->html();
    

    Now, without a charset meta tag, it prints:

    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('&#27979;&#35797;');
        }
    </script>
    

    OK, you may already knew all of this. So, what's with the &#27979;&#35797;? Why the div content are showing as is, but the same content in the script tag is getting html-encoded?

    Symfony's Crawler class, as it explains itself, converts the content to HTML entities due to a bug in DOMDocument::loadHTML():

    When using loadHTML() to process UTF-8 pages, you may meet the problem that the output of DOM functions are not like the input. For example, if you want to get "Cạnh tranh", you will receive "Cạnh tranh". I suggest we use mb_convert_encoding before loading UTF-8 page.
    https://php.net/manual/en/domdocument.loadhtml.php#74777

    Some suggest to add a HTML4 Content-Type meta tag into the head element. Some other suggest to prepend a <?xml encoding="UTF-8"> to the HTML content before passing it to loadHTML(). As your HTML structure is not complete (lacks head, body, etc.), I recommend you simply pass the output to html_entity_decode():

    <?php
    
    require 'vendor/autoload.php';
    
    use Symfony\Component\DomCrawler\Crawler;
    
    $html = <<<HTML
    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    HTML;
    
    $crawler = new Crawler();
    $crawler->addHTMLContent($html, 'UTF-8');
    
    print html_entity_decode($crawler->html());
    

    Outputs:

    <div>测试</div>
    <script charset="utf-8" type="text/javascript">
        function drawCharts(){
            console.log('测试');
        }
    </script>
    

    Which is what you want.

    You might also want to read:
    PHP DOMDocument loadHTML not encoding UTF-8 correctly

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么
  • ¥15 banner广告展示设置多少时间不怎么会消耗用户价值
  • ¥16 mybatis的代理对象无法通过@Autowired装填
  • ¥15 可见光定位matlab仿真
  • ¥15 arduino 四自由度机械臂
  • ¥15 wordpress 产品图片 GIF 没法显示
  • ¥15 求三国群英传pl国战时间的修改方法
  • ¥15 matlab代码代写,需写出详细代码,代价私
  • ¥15 ROS系统搭建请教(跨境电商用途)