xPath删除<br>并将多行文本推送到一个数组中

I've been lurking Stackoverflow for a possible answer for hours and although I found some solutions, none worked in my case.

I need to get the text of the div and run it thru a foreach loop to eventually create a new database record for each div content.

Everything works until I face divs with multi-line content and <br> tags.

I have tried:

$quotes = $finder->query("//*[contains(@class, normalize-space('$quote'))]//text()");

But it doesn't seem like normalize-space() has any effect because instead of pushing the whole text into one array it creates a new array after every <br> instead.

More code:

$quotes = $finder->query("//*[contains(@class, normalize-space('$quote'))]//text()");
$authors = $finder->query("//*[starts-with(@class,'$author')]/child::a");

    foreach ($quotes as $key => $quote) {
        {
            $quote = trim($quote->textContent);
            $dataArr[] = $quote;
            $authorName = preg_split("/[\s,-,@]+/", $authors[$key]->textContent);

            if (count($authorName) < 5) {
                $authorName = $authorName[1];
            } else if (count($authorName) > 5) {
                $authorName = $authorName[1] . ' ' . $authorName[2] . ' ' . $authorName[3];
            } else if (count($authorName) > 6) {
                $authorName = $authorName[1] . ' ' . $authorName[2] . ' ' . $authorName[3] . ' ' . $authorName[4];
            } else {
                $authorName = $authorName[1] . ' ' . $authorName[2];
            }
            array_push($dataArr, $authorName);
    }

HTML structure that is extracted correctly:

<div class="b-list-quote2__item "><a href="/" class="b-list-quote2__item-text js-quote-text">
    A random quote here...
</a><div class="b-list-quote2__item-category">
    <a href="/quotes/albert-einshtein?q=17856">Albert Einstein</a>

In this case, I get an Array with the Quote and Author that I later chunk by 2 and use in other functions

[0] => A random quote here... [1] => Albert Einstein

HTML structure I'm having the problem with:

<div class="b-list-quote2__item "><a href="/" class="b-list-quote2__item-text js-quote-text" style="position: relative; max-height: none;">
    Quote line 0,
    <br>Quote line 1,
    <br>Quote line 2,
    <br>Quote line 3,
</a><div class="b-list-quote2__item-category">
    <a href="/quotes/karmelita-kruglaia?q=249176">Tesla</a>

In this case, a new array item is added for every line of text thus something like

[0] => Quote line 0 [1] => Quote line 1 [2] => Quote line 2 [3] => Quote line 3

With no "author" in the array which in this case should be "Tesla".

How a good array should look:

[0] => Quote line 0 Quote line 1 Quote line 2 Quote line 3 [1] => Tesla

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dougong8012 2018-05-19 17:44
关注
When your xpath query is running, the last part is asking for each of the text nodes to be extracted separately (the //text() bit on the end of the expression). Instead you just want the text of the whole element. With DOM, each piece of text is a separate node, so

Quote line 0, <br>Quote line 1,

Is two separate text nodes. Your query is retrieving this (as you've found) as 2 elements.

So using

$quotes = $finder->query("//*[contains(@class, normalize-space('$quote'))]");

Should give you all of the text. The text will have line breaks in it, so you can do ...

$dataArr[] = str_replace(" ", " ", $quote);
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

xPath删除<br>并将多行文本推送到一个数组中 php
2018-05-19 16:56

回答 1 已采纳 When your xpath query is running, the last part is asking for each of the text nodes to be extract
Python Xpath对<em>标签的爬取结果变成了- -(关键词-Xpath) python 爬虫
2023-01-28 10:30

回答 4 已采纳你的代码没问题哟,你最好打印下html1,然后核对下html1是不和你预期是一样的
使用DOMxpath或regex删除<p> <br/> </ p>？ php
2011-07-26 23:38

回答 3 已采纳 You can select the unwanted p using XPath: "//p[count(*)=count(br) and br and normalize-space(.)=
最全CTF Web题思路总结（更新ing）
2022-02-12 23:37

yjprolus的博客针对数组中每个元素，将在当前符号表中创建一个对应的变量 <?php $flag = 'aaa'; extract($_GET); if (isset($gift)) { $content = trim(file_get_contents($flag)); if ($gift == $content) { echo 'flag{...
DOMXpath和PHP：如何在<ul>中包含一堆<li> php
2015-11-25 19:41

回答 2 已采纳 Maybe you can get the parentNode of the first <li> and then use the insertBefore method: $h
如何在使用DOMDocument时将文本内容分隔为<BR> php
2017-01-12 05:00

回答 3 已采纳 In your example, $n contains 5 child nodes: "Name" "<br/>" " " "<span class='class2'&gt
如果要获取<div>标签下的文本内容该怎么做呢？（用xpath,bs4,re中的一个都行） python 有问必答爬虫
2021-10-30 14:14

回答 4 已采纳鼠标右键复制xpath，然后在python中/text()或者.text获取文本内容，如图：有帮助的话采纳一下哦！
ExMobi®从入门到精通
2017-08-02 16:55

jkdev的博客 ExMobi®从入门到精通本书电子版和示例代码请访问GIT仓库： https://github.com/nandy007/ExMobiBeginnerBook ExMobi门户：http://www.exmobi.cn ExMobi论坛：...
将嵌套标记值推送到PHP数组中 php xml
2014-07-25 03:08

回答 2 已采纳 DOMXpath::query() can only return node lists, DOMXpath::evaluate() can return scalars, too. The se
使用DOMXPath清理已弃用的HTML代码（将嵌套的<div>标记转换为<p>标记） html php
2019-05-05 10:13

回答 1 已采纳 There are a couple of things I've changed. The first is that rather than just append the existing
Xpath循环问题，用于将简单的HTML表解析为php数组 html php
2019-02-27 07:51

回答 1 已采纳 $strhtml=' <table id="Details" class="DATA_TABLE DATA_TABLE_WO_TOTAL"> <tr> <
04 渗透测试基础
2021-01-04 11:48

倔强的青铜选手。。。的博客一代码审计 1 基础环境搭建 (1) Web服务：WAMP+phpstudy (2) phpstudy 启动问题端口正常开放 80 http 3306 mysql web根目录[C:\Users\dq\Documents\phpStudy-1-24\phpStudy\WWW] php 探针 phpinfo.php phpmyadmin...
用xpath爬取文本时如何去掉非文本内容 python 爬虫
2021-12-18 14:35

回答 1 已采纳 discribe =html.xpath('normalize-space(//div[@class="container-fluid"]//div[@class="work_b"]//text()
Python总复习-下
2020-10-21 00:42

花开如雨的博客标记也叫标签或元素,标记页面结构和内容,配合CSS实现页面整体布局和美化以<>为标志 3. 网页在计算机中以.html 或 .htm 后缀标识网页文件的打开工具 - 浏览器开发工具 : 记事本, sublime,VSCode,editPlus,...
ExMobi文档
2015-07-04 11:25

shizhesx的博客 ExMobi®从入门到精通本书电子版和示例代码请访问GIT仓库： https://github.com/nandy007/ExMobiBeginnerBook ExMobi门户：http://www.exmobi.cn ExMobi论坛：...
没有解决我的问题, 去提问

悬赏问题

¥30 vmware exsi重置后的密码
¥15 易盾点选的cb参数怎么解啊
¥15 MATLAB运行显示错误，如何解决？
¥15 c++头文件不能识别CDialog
¥15 Excel发现不可读取的内容
¥15 关于#stm32#的问题：CANOpen的PDO同步传输问题
¥20 yolov5自定义Prune报错，如何解决？
¥15 电磁场的matlab仿真
¥15 mars2d在vue3中的引入问题
¥50 h5唤醒支付宝并跳转至向小荷包转账界面

xPath删除<br>并将多行文本推送到一个数组中

1条回答 默认 最新

悬赏问题

1条回答默认最新