dounangqie4819 2016-05-12 08:52
浏览 37
已采纳

DOMDocument缺少HTML标记

I play an online game called Tribalwars, and am now trying to write a report parser. A typical report looks like this:

https://enp2.tribalwars.net/public_report/395cf3cc373a3b8873c20fa018f1aa07

I have two functions adapted from php.net that now look as follows:

function has_child($p)
{
    if ($p->hasChildNodes())
    {
        foreach ($p->childNodes as $c)
        {
            if ($c->nodeType == XML_ELEMENT_NODE)
            {
                return true;
            }
        }
    }
    return false;
}

function show_node($x)
{
    foreach ($x->childNodes as $p)
    {
        if ($this->has_child($p))
        {
            $this->show_node($p);
        }
        elseif ($p->nodeType == XML_ELEMENT_NODE)
        {
            if (trim($p->nodeValue) !== '')
            {
                $temp = explode("
", $p->nodeValue);
                if (count($temp) == 1)
                {
                    $this->reportdata[] = trim($temp[0]);
                }
                else
                {
                    foreach ($temp as $k => $v)
                    {
                        if (trim($v) !== '')
                        {
                            $this->reportdata[] = trim($v);
                        }
                    }
                }
            }
        }
    }
}

It returns the result in the following format:

Array
(
    [0] => MASHAD (27000) attacks 40-014-Devil...
    [1] => May 11, 2016  19:27:12
    [2] => MASHAD has won
    [3] => Attacker's luck
    ...
    [76] => Espionage
    [77] => Resources scouted:
    [78] => Building
    ...
    [112] => Haul:
    [113] => .
    [114] => .
    [115] => .
    [116] => .
    [117] => .
    ...
    [120] => https://enp2.tribalwars.net/public_report/395...
)

For the most part this works, but some data goes lost in the parsing. If you look at the report at the link, you will see "Resources scouted" and "Haul" sections. Both these sections contain <span>, incidentally. For some reason those two sections are missing in the array that the functions return. (See array item 77 and array items 113 - 118). Lines 113 - 118 just show the . of the strangely formatted number, line 77 just has nothing.

In the function where I call the show_node() function, I do a little bit of processing to throw out unnecessary DOM code that is not needed:

$temp = explode('<h1>Publicized report</h1>', $report[0]['reportdata']);
$rep = $temp[1];
$temp = explode('For quick copy and paste', $rep);
$rep = '<report>' . $temp[0] . '</report>';
$x = new DOMDocument();
$x->loadHTML($rep);
$this->show_node($x->getElementsByTagName('report')->item(0));

If I do an output of $rep before calling the show_node() function, the information I need for Haul and Resources scouted is present.

What could be the problem?

  • 写回答

1条回答 默认 最新

  • dqpu4988 2016-05-29 09:36
    关注

    It appears as if DOMDocument has a limit on how deep in the document it goes to or something. Either that or the recursive code above is wrong. I have therefore identified the piece of code that is not being parsed, saw that it is well-formed and then went on to remove its children that I do not need with str_replace(), and that ended up getting the values in my array. Anyway, this problem is now resolved.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 TLS1.2协议通信解密
  • ¥40 图书信息管理系统程序编写
  • ¥20 Qcustomplot缩小曲线形状问题
  • ¥15 企业资源规划ERP沙盘模拟
  • ¥15 树莓派控制机械臂传输命令报错,显示摄像头不存在
  • ¥15 前端echarts坐标轴问题
  • ¥15 ad5933的I2C
  • ¥15 请问RTX4060的笔记本电脑可以训练yolov5模型吗?
  • ¥15 数学建模求思路及代码
  • ¥50 silvaco GaN HEMT有栅极场板的击穿电压仿真问题