dqf42223 2015-12-22 20:42
浏览 58
已采纳

使用xpath查询将xpath正则表达式捕获组作为字符串返回

CONTEXT

Supposing the following HTML

....
<p>Whatever</p>
<div>Whatever DIV78232 Everwhat</div>
....

Question:

How could I return a plain text string containing DIVnnnnn, where nnnnn represents any digits.

My investigation so far:

The xPath replace() function will replace a pattern found inside the current DOM.

replace(.,'.*?(DIV\d+).*','$1') => DIV78232

Why am I blocked?

Because I want the query to return the "DIV78232" as a string, without actually replacing it in the DOM at all, just as it would return "Whatever" for the query /p/text() [I am trying all this on the FirePath firefox-extension]

Note: According to the official DOCS

"replace() Returns the value of the first argument with every substring matched by the regular expression that is the value of the second argument replaced by the replacement string that is the value of the third argument."

FINAL PURPOSE:

My final purpose is to get the (string) IMAGE URL that matches '.*?image:.*?"(.+?)".*' from this (which is inside the HTML):

In this case, the query //*[matches(.,'.*?image:.*?"(.+?)".*','i')] returns the whole node, but I only want the first Capturing Group, which would be the IMAGE URL.

<script>...vp&output=xml_vast2&unviewed_position_start=1&
url='+encodeURIComponent(location.href)+'
description_url='+encodeURIComponent(location.href)+'&
image:   "https://domain.com/xxxxxxx/public_images/2015.12/article/56797be1c46188ac438b45c3.jpg", // stretching: 'fi..</script>
  • 写回答

1条回答 默认 最新

  • dongyan2469 2015-12-22 22:17
    关注

    Took me a long while, but this is the result I got by combinating replace() and tokenize()

    tokenize(replace(.,'.*?image:.*?"(.+?)".*?',':@:$1:@:'),':@:')[2]

    Returns the image URL in the snippet above mentioned.

    Why/How does this work?

    • Replace() matches the image and wraps the capturing group with my own token separator ':@:' (Could be anything original)
    • Tokenize() splits the replaced string in 3 parts, being the second one the capturing group I was looking for. (It will be three parts because it is highly improbable that the document will contain ':@:' anywhere else)

    Is there any faster way to achieve this?

    Thanks. All the best. Peace.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥60 版本过低apk如何修改可以兼容新的安卓系统
  • ¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
  • ¥50 有数据,怎么建立模型求影响全要素生产率的因素
  • ¥50 有数据,怎么用matlab求全要素生产率
  • ¥15 TI的insta-spin例程
  • ¥15 完成下列问题完成下列问题
  • ¥15 C#算法问题, 不知道怎么处理这个数据的转换
  • ¥15 YoloV5 第三方库的版本对照问题
  • ¥15 请完成下列相关问题!
  • ¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像,如何解决?