Let's see how symfony/dom-crawler works. Here's an example to start with:
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = <<<HTML
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
HTML;
$crawler = new Crawler($html);
print $crawler->html();
It outputs:
<div>æµè¯</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
When you pass the content through the constructor, the Crawler
class does its best to figure out the encoding. If it fails to figure anything out, it falls back to ISO-8859-1
; which is the default charset defined by the HTTP 1.1 specification.
If your HTML content contains a charset meta tag, the Crawler class will read the charset from it, set it and convert from it properly. Here's the same above example with a charset meta tag prepended to the HTML content:
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = <<<HTML
<meta charset="utf-8">
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
HTML;
$crawler = new Crawler($html);
print $crawler->html();
Now it prints:
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
If you don't want to add the charset meta tag, there's another way; addHTMLContent()
method accepts a charset as its second argument and it defaults to UTF-8
. Instead of passing the HTML content through the constructor, first instantiate the class and then add the content using this method:
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = <<<HTML
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
HTML;
$crawler = new Crawler;
// You can safely drop the 2nd argument
$crawler->addHTMLContent($html, 'UTF-8');
print $crawler->html();
Now, without a charset meta tag, it prints:
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
OK, you may already knew all of this. So, what's with the 测试
? Why the div
content are showing as is, but the same content in the script
tag is getting html-encoded?
Symfony's Crawler
class, as it explains itself, converts the content to HTML entities due to a bug in DOMDocument::loadHTML()
:
When using loadHTML()
to process UTF-8 pages, you may meet the problem that the output of DOM functions are not like the input. For example, if you want to get "Cạnh tranh", you will receive "Cạnh tranh". I suggest we use mb_convert_encoding
before loading UTF-8 page.
– https://php.net/manual/en/domdocument.loadhtml.php#74777
Some suggest to add a HTML4 Content-Type
meta tag into the head element. Some other suggest to prepend a <?xml encoding="UTF-8">
to the HTML content before passing it to loadHTML()
. As your HTML structure is not complete (lacks head
, body
, etc.), I recommend you simply pass the output to html_entity_decode()
:
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = <<<HTML
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
HTML;
$crawler = new Crawler();
$crawler->addHTMLContent($html, 'UTF-8');
print html_entity_decode($crawler->html());
Outputs:
<div>测试</div>
<script charset="utf-8" type="text/javascript">
function drawCharts(){
console.log('测试');
}
</script>
Which is what you want.
You might also want to read:
PHP DOMDocument loadHTML not encoding UTF-8 correctly