dongshanxiao7328 2018-05-19 16:56
浏览 413
已采纳

xPath删除<br>并将多行文本推送到一个数组中

I've been lurking Stackoverflow for a possible answer for hours and although I found some solutions, none worked in my case.

I need to get the text of the div and run it thru a foreach loop to eventually create a new database record for each div content.

Everything works until I face divs with multi-line content and <br> tags.

I have tried:

$quotes = $finder->query("//*[contains(@class, normalize-space('$quote'))]//text()");

But it doesn't seem like normalize-space() has any effect because instead of pushing the whole text into one array it creates a new array after every <br> instead.

More code:

$quotes = $finder->query("//*[contains(@class, normalize-space('$quote'))]//text()");
$authors = $finder->query("//*[starts-with(@class,'$author')]/child::a");

    foreach ($quotes as $key => $quote) {
        {
            $quote = trim($quote->textContent);
            $dataArr[] = $quote;
            $authorName = preg_split("/[\s,-,@]+/", $authors[$key]->textContent);

            if (count($authorName) < 5) {
                $authorName = $authorName[1];
            } else if (count($authorName) > 5) {
                $authorName = $authorName[1] . ' ' . $authorName[2] . ' ' . $authorName[3];
            } else if (count($authorName) > 6) {
                $authorName = $authorName[1] . ' ' . $authorName[2] . ' ' . $authorName[3] . ' ' . $authorName[4];
            } else {
                $authorName = $authorName[1] . ' ' . $authorName[2];
            }
            array_push($dataArr, $authorName);
    }

HTML structure that is extracted correctly:

<div class="b-list-quote2__item "><a href="/" class="b-list-quote2__item-text js-quote-text">
    A random quote here...
</a><div class="b-list-quote2__item-category">
    <a href="/quotes/albert-einshtein?q=17856">Albert Einstein</a>

In this case, I get an Array with the Quote and Author that I later chunk by 2 and use in other functions

[0] => A random quote here... [1] => Albert Einstein

HTML structure I'm having the problem with:

<div class="b-list-quote2__item "><a href="/" class="b-list-quote2__item-text js-quote-text" style="position: relative; max-height: none;">
    Quote line 0,
    <br>Quote line 1,
    <br>Quote line 2,
    <br>Quote line 3,
</a><div class="b-list-quote2__item-category">
    <a href="/quotes/karmelita-kruglaia?q=249176">Tesla</a>

In this case, a new array item is added for every line of text thus something like

[0] => Quote line 0 [1] => Quote line 1 [2] => Quote line 2 [3] => Quote line 3

With no "author" in the array which in this case should be "Tesla".

How a good array should look:

[0] => Quote line 0 Quote line 1 Quote line 2 Quote line 3 [1] => Tesla

  • 写回答

1条回答 默认 最新

  • dougong8012 2018-05-19 17:44
    关注

    When your xpath query is running, the last part is asking for each of the text nodes to be extracted separately (the //text() bit on the end of the expression). Instead you just want the text of the whole element. With DOM, each piece of text is a separate node, so

    Quote line 0,
    <br>Quote line 1,
    

    Is two separate text nodes. Your query is retrieving this (as you've found) as 2 elements.

    So using

    $quotes = $finder->query("//*[contains(@class, normalize-space('$quote'))]");
    

    Should give you all of the text. The text will have line breaks in it, so you can do ...

    $dataArr[] = str_replace("
    ", " ", $quote);
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥30 vmware exsi重置后的密码
  • ¥15 易盾点选的cb参数怎么解啊
  • ¥15 MATLAB运行显示错误,如何解决?
  • ¥15 c++头文件不能识别CDialog
  • ¥15 Excel发现不可读取的内容
  • ¥15 关于#stm32#的问题:CANOpen的PDO同步传输问题
  • ¥20 yolov5自定义Prune报错,如何解决?
  • ¥15 电磁场的matlab仿真
  • ¥15 mars2d在vue3中的引入问题
  • ¥50 h5唤醒支付宝并跳转至向小荷包转账界面