doudanma9706 2018-01-29 07:23 采纳率: 100%
浏览 68
已采纳

使用DOMDocument解析HTML时的Rogue元素

Lets assume my $html looks like this:

<!DOCTYPE html>
<html>
<head>
    <script type="text/javascript">document.createElement("video");document.createElement("audio");document.createElement("track");</script>
    <script type="text/javascript" src="/gui/default/tinymcecontent.js"></script>
    <script type="text/javascript" src="/includes/js/video-js/video.min.js"></script>
    <link rel="stylesheet" href="/includes/js/video-js/video-js.css" />
    <script type="text/javascript">document.createElement("video");document.createElement("audio");document.createElement("track");</script>
    <script type"text/javascript" src="/includes/js/video-js/video.js"></script/>
    <link rel="stylesheet" href="/includes/js/video-js/video-js.css" />
</head>
<body style="font-family: arial;font-size: 12px;">
    <p> </p>
    <table width="100%">        
    </table>
</body>
</html>

When I try to parse only elements, that are inside body tag with commands:

$dom = new DOMDocument();

libxml_use_internal_errors(true);
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
libxml_use_internal_errors(false);

$full_dom = $dom->getElementsByTagName('body')->item(0);

The result of

$dom->saveHTML($full_dom)

is

<body>
<p>\/&gt;<link rel=\"stylesheet\" href=\"\/includes\/js\/video-js\/video-js.css\"><\/p>
<p>\u00a0<\/p>
<table width=\"100%\"><\/table>
<\/body>

Element

<p>\/&gt;<link rel=\"stylesheet\" href=\"\/includes\/js\/video-js\/video-js.css\"><\/p>

comes from where? Everything else is good, just this element gets transfered from head tag into elements of body tag..

  • 写回答

1条回答 默认 最新

  • douxin1163 2018-01-29 07:32
    关注

    It comes from the line :

    <script type"text/javascript" src="/includes/js/video-js/video.js"></script/>
    

    It is badly formed and should be :

    <script type="text/javascript" src="/includes/js/video-js/video.js"></script>
    

    You have to check errors after $dom->loadHTML() to see what's happend :

    foreach (libxml_get_errors() as $error) {
        print_r($error);
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 ETLCloud 处理json多层级问题
  • ¥15 matlab中使用gurobi时报错
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么
  • ¥15 banner广告展示设置多少时间不怎么会消耗用户价值
  • ¥15 可见光定位matlab仿真
  • ¥15 arduino 四自由度机械臂
  • ¥15 wordpress 产品图片 GIF 没法显示