使用xpath查询将xpath正则表达式捕获组作为字符串返回

CONTEXT

Supposing the following HTML

....
<p>Whatever</p>
<div>Whatever DIV78232 Everwhat</div>
....

Question:

How could I return a plain text string containing DIVnnnnn, where nnnnn represents any digits.

My investigation so far:

The xPath replace() function will replace a pattern found inside the current DOM.

replace(.,'.*?(DIV\d+).*','$1') => DIV78232

Why am I blocked?

Because I want the query to return the "DIV78232" as a string, without actually replacing it in the DOM at all, just as it would return "Whatever" for the query /p/text() [I am trying all this on the FirePath firefox-extension]

Note: According to the official DOCS

"replace() Returns the value of the first argument with every substring matched by the regular expression that is the value of the second argument replaced by the replacement string that is the value of the third argument."

FINAL PURPOSE:

My final purpose is to get the (string) IMAGE URL that matches '.*?image:.*?"(.+?)".*' from this (which is inside the HTML):

In this case, the query //*[matches(.,'.*?image:.*?"(.+?)".*','i')] returns the whole node, but I only want the first Capturing Group, which would be the IMAGE URL.

<script>...vp&output=xml_vast2&unviewed_position_start=1&
url='+encodeURIComponent(location.href)+'
description_url='+encodeURIComponent(location.href)+'&
image:   "https://domain.com/xxxxxxx/public_images/2015.12/article/56797be1c46188ac438b45c3.jpg", // stretching: 'fi..</script>

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongyan2469 2015-12-22 22:17
关注
Took me a long while, but this is the result I got by combinating replace() and tokenize()

tokenize(replace(.,'.*?image:.*?"(.+?)".*?',':@:$1:@:'),':@:')[2]

Returns the image URL in the snippet above mentioned.

Why/How does this work?

Replace() matches the image and wraps the capturing group with my own token separator ':@:' (Could be anything original)

Tokenize() splits the replaced string in 3 parts, being the second one the capturing group I was looking for. (It will be three parts because it is highly improbable that the document will contain ':@:' anywhere else)

Is there any faster way to achieve this?

Thanks. All the best. Peace.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

使用xpath查询将xpath正则表达式捕获组作为字符串返回 php
2015-12-22 20:42

回答 1 已采纳 Took me a long while, but this is the result I got by combinating replace() and tokenize() tokeni
关于xpath和正则表达式应用的问题 python
2021-02-14 20:00

回答 2 已采纳 import re html = ''' <tr style="height:44px;" tridx="3" id="r-3-0"> <td class="fh tac
在使用PHP正则表达式的字符串后查找以下x次出现？ php
2012-04-24 14:08

回答 1 已采纳 Whilst using regex to parse html is usually not good, using it to match certain pieces of html for
Python 解析本地html文件之XPath语法、XPath结合正则表达式使用、实例
2022-05-11 14:49

CDamogu的博客 etree.xpath BeautifulSoup pyquery ...本文重点介绍etree.xpath etree.xpath 使用 ...Python 正则表达式 模块导入from lxml import etree 载入本地html文件或者字符串 载入本地html文件,并完整打印
正则，xpath，bs4匹配 python 正则表达式
2021-09-22 15:54

回答 1 已采纳一、本质原因是Unicode在HTML中和Python中的不同表示方法以unicode e412为例在HTML中 <p> &#xe412 </p> 在python中 '\
如何使用XPath解析HTML字符串 html
2018-10-05 10:08

回答 1 已采纳 you can use htmlquery: doc, err := htmlquery.LoadURL("http://example.com/") or use string: s
在PHP中使用XPath替换XML属性 php xml
2019-06-11 17:26

回答 1 已采纳 The answer as Nigel Ren suggested was just to remove these two lines, as they no longer apply: $
php 正则字符串替换字符串,PHP用正则表达式替换字符串(Php replace string with regex)...
2021-03-23 13:43

吸奇侠的博客 PHP用正则表达式替换字符串(Php replace string with regex)我想用“”替换我的文件中的所有标签“”。我试过这个解决方案：$_text = preg_replace('', '', $_text);但我用“<>”替换“”$_text = preg_...
Xpath查询返回部分空值（PHP） php xml
2016-09-28 11:42

回答 1 已采纳 If you do something like: $xml = simplexml_load_string($tmpstr); $smsts = $xml->xpath('//TS');
使用DOMXPath用XPath表达式替换foreach循环 php xml
2017-01-16 17:09

回答 1 已采纳 Xpath 1.0 expression will return a list of nodes, they can to some extend flatten an existing stru
php xpath将节点值作为Array返回 php xml
2014-03-08 15:04

回答 1 已采纳 XPath queries return a sequence of result nodes, not a single one. SimpleXML returns this as an ar
python爬虫里信息提取的核心方法: Beautifulsoup、Xpath和正则表达式
2017-06-01 22:33

LINGOJAMES的博客提取的手段主要有三种：xpath、BeautifulSoup、正则表达式（Re）。下面分别进行介绍：（一）BeautifulSoup 从本心来说，我更喜欢用BeautifulSoup。因为它更符合直观语义特性，find（）和find_all...
php解析html内容的字符串变量中的XPath php
2014-09-09 11:28

回答 1 已采纳 You mean something like this? $doc->loadXML('<img src="path/to/image.ext><br>some
正则表达式 详解
2021-12-26 18:09

Yy_Rose的博客详解正则表达式及其相关用法，归纳总结常用的匹配规则模式
Python正则表达式整理总结
2021-01-05 14:20

zhaoyun198769的博客 正则表达式（Regular expression）是组成搜索模式的一组字符序列，是记录文本规则的代码，用来检查文本中是否包含指定模式的字符串，通过定义一个规则来匹配字符串。正则表达式广泛应用于在字符串查找和处理中，大多...
没有解决我的问题, 去提问

悬赏问题

¥20 有关区间dp的问题求解
¥15 多电路系统共用电源的串扰问题
¥15 slam rangenet++配置
¥15 有没有研究水声通信方面的帮我改俩matlab代码
¥15 对于相关问题的求解与代码
¥15 ubuntu子系统密码忘记
¥15 信号傅里叶变换在matlab上遇到的小问题请求帮助
¥15 保护模式-系统加载-段寄存器
¥15 电脑桌面设定一个区域禁止鼠标操作
¥15 求NPF226060磁芯的详细资料

使用xpath查询将xpath正则表达式捕获组作为字符串返回

1条回答 默认 最新

悬赏问题

1条回答默认最新