dounangqie4819 2016-05-12 08:52
浏览 37
已采纳

DOMDocument缺少HTML标记

I play an online game called Tribalwars, and am now trying to write a report parser. A typical report looks like this:

https://enp2.tribalwars.net/public_report/395cf3cc373a3b8873c20fa018f1aa07

I have two functions adapted from php.net that now look as follows:

function has_child($p)
{
    if ($p->hasChildNodes())
    {
        foreach ($p->childNodes as $c)
        {
            if ($c->nodeType == XML_ELEMENT_NODE)
            {
                return true;
            }
        }
    }
    return false;
}

function show_node($x)
{
    foreach ($x->childNodes as $p)
    {
        if ($this->has_child($p))
        {
            $this->show_node($p);
        }
        elseif ($p->nodeType == XML_ELEMENT_NODE)
        {
            if (trim($p->nodeValue) !== '')
            {
                $temp = explode("
", $p->nodeValue);
                if (count($temp) == 1)
                {
                    $this->reportdata[] = trim($temp[0]);
                }
                else
                {
                    foreach ($temp as $k => $v)
                    {
                        if (trim($v) !== '')
                        {
                            $this->reportdata[] = trim($v);
                        }
                    }
                }
            }
        }
    }
}

It returns the result in the following format:

Array
(
    [0] => MASHAD (27000) attacks 40-014-Devil...
    [1] => May 11, 2016  19:27:12
    [2] => MASHAD has won
    [3] => Attacker's luck
    ...
    [76] => Espionage
    [77] => Resources scouted:
    [78] => Building
    ...
    [112] => Haul:
    [113] => .
    [114] => .
    [115] => .
    [116] => .
    [117] => .
    ...
    [120] => https://enp2.tribalwars.net/public_report/395...
)

For the most part this works, but some data goes lost in the parsing. If you look at the report at the link, you will see "Resources scouted" and "Haul" sections. Both these sections contain <span>, incidentally. For some reason those two sections are missing in the array that the functions return. (See array item 77 and array items 113 - 118). Lines 113 - 118 just show the . of the strangely formatted number, line 77 just has nothing.

In the function where I call the show_node() function, I do a little bit of processing to throw out unnecessary DOM code that is not needed:

$temp = explode('<h1>Publicized report</h1>', $report[0]['reportdata']);
$rep = $temp[1];
$temp = explode('For quick copy and paste', $rep);
$rep = '<report>' . $temp[0] . '</report>';
$x = new DOMDocument();
$x->loadHTML($rep);
$this->show_node($x->getElementsByTagName('report')->item(0));

If I do an output of $rep before calling the show_node() function, the information I need for Haul and Resources scouted is present.

What could be the problem?

  • 写回答

1条回答 默认 最新

  • dqpu4988 2016-05-29 09:36
    关注

    It appears as if DOMDocument has a limit on how deep in the document it goes to or something. Either that or the recursive code above is wrong. I have therefore identified the piece of code that is not being parsed, saw that it is well-formed and then went on to remove its children that I do not need with str_replace(), and that ended up getting the values in my array. Anyway, this problem is now resolved.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值
  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 关于大棚监测的pcb板设计
  • ¥15 stm32开发clion时遇到的编译问题
  • ¥15 lna设计 源简并电感型共源放大器
  • ¥15 如何用Labview在myRIO上做LCD显示?(语言-开发语言)
  • ¥15 Vue3地图和异步函数使用