doucai6663 2013-11-14 13:53
浏览 40
已采纳

Symfony2 DomCrawler和FB2书籍格式解析器

All!

How do I parse correctly described XML file with Symfony2 DomCrawler component?

I need to split all the sections and collect an internal tags (epigraph, p, poem etc.) with the current section together which belongs to this section only.

I've standard FB2 book XML format described below:

<?xml version="1.0" encoding="utf-8"?>
<FictionBook xmlns="http://www.gribuser.ru/xml/fictionbook/2.0" xmlns:l="http://www.w3.org/1999/xlink">
<description></description>
<body>
<section>
    <title><p><strong>Level 1, section 1</strong></p></title>
    <section>
        <title><p><strong>Level 2, section 2</strong></p></title>
        <section>
            <title><p><strong>Level 3, section 3</strong></p></title>
            <p>Level 3, section 3, paragraph 1</p>
            <poem>
                <stanza>
                    <v>bla-bla-bla 1</v>
                    <v>bla-bla-bla 2</v>
                    <v>bla-bla-bla 3</v>
                </stanza>
            </poem>
            <p>Level3, section 3, paragraph 2</p>
            <subtitle><strong>x x x</strong></subtitle>
        </section>
        <section>
            <title><p><strong>Level 3, section 4</strong></p></title>
            <p>Level 3, section 4, paragraph 1</p>
            <p>Level 3, section 4, paragraph 2</p>
            <subtitle><strong>x x x</strong></subtitle>
        </section>
        <section>
            <title><p><strong>Level 3, section 5</strong></p></title>
            <p>Level 3, section 5, paragraph 1</p>
            <p>Level 3, section 5, paragraph 2</p>
            <p>Level 3, section 5, paragraph 3</p>
            <empty-line/>
            <subtitle>This file was created</subtitle>
            <subtitle>with BookDesigner program</subtitle>
            <subtitle>bookdesigner@the-ebook.org</subtitle>
            <subtitle>22.04.2004</subtitle>
        </section>
    </section>
</section>
</body>
</FictionBook>

The code below do not work, so could somebody help me to solve this? Btw, title parsed correctly... but section's tags not...

private function loadBookSections(Crawler $crawler)
{
    $sections = $crawler->filter('section')->each(function(Crawler $node) {
        $c = $node->filter('section')->reduce(function(Crawler $node, $i) {
            return ($i == 0);
        });

        return array(
            'title' => $node->filter('title')->text(),
            'inner' => $c->html(),
        );
    });

    echo "*******************************************
";

    foreach($sections as $section ) {
        echo ">>> ".$section['title']."
";
        echo "!!! ".$section['inner']."
";
    }
}

And Thanks for help!

  • 写回答

2条回答 默认 最新

  • dq1685513999 2013-11-20 15:12
    关注

    After four days... I've found the solution via XPath...

    private function loadBookSections(Crawler $crawler)
    {
    
        $sections = $crawler->filter('section')->each(function(Crawler $node) {
            return array(
                'title' => $node->filter('title')->text(),
                'inner' => $node->filterXPath("//*[not(section)]")->html(),
            );
        });
    
        foreach($sections as $section) {
            echo "TITLE: ".$section['title']."
    ";
            echo "INNER: ".$section['inner']."
    ";
        }
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥30 深度学习,前后端连接
  • ¥15 孟德尔随机化结果不一致
  • ¥15 apm2.8飞控罗盘bad health,加速度计校准失败
  • ¥15 求解O-S方程的特征值问题给出边界层布拉休斯平行流的中性曲线
  • ¥15 谁有desed数据集呀
  • ¥20 手写数字识别运行c仿真时,程序报错错误代码sim211-100
  • ¥15 关于#hadoop#的问题
  • ¥15 (标签-Python|关键词-socket)
  • ¥15 keil里为什么main.c定义的函数在it.c调用不了
  • ¥50 切换TabTip键盘的输入法