dongyi6543 2013-03-28 15:07
浏览 36
已采纳

从Wiki XML语法中提取图像路径

i try to parse the Wikipedia XML which i get from the xml wikipedia export

In one case i need to extract all image path. The raw markup looks like,

  [[Bild:nameOfImage.png|image description]]

"Bild" can also be "Image", "File" or "Datei"

To extract the text for an Image i use this regex.

'|\[\[.*\|.*\]\]|U'

This works fine, if in the image description isn't an other '[[ .. ]]', like

[[Bild:nameOfImage.png|image Description with a [[new wiki link]] ]]

My question is, how can i modify the Regex to get all text between the first "[[" and the last "]]" without to count all '[' an ']' character.

thanks in advance

  • 写回答

1条回答 默认 最新

  • donljt2606 2013-03-28 15:50
    关注

    Since you're using PHP, you're probably able to use recursive patterns.
    Considering you're not capturing anything:

    /\[\[(((?>[^\[\]])|(?R))*)\]\]/U
    

    Note that I haven't tried this regex since I have no way to use PHP.

    Edit:

    preg_match('/\[\[(?>[^\[\]]|(?R))*\]\]/U', '[[Bild:nameOfImage.png|image Description with a [[new wiki link]] ]]', $array);
    var_dump($array);
    

    seems to work.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 QTableWidget重绘程序崩溃
  • ¥15 51寻迹小车定点寻迹
  • ¥15 谁能帮我看看这拒稿理由啥意思啊阿啊
  • ¥15 关于vue2中methods使用call修改this指向的问题
  • ¥15 idea自动补全键位冲突
  • ¥15 请教一下写代码,代码好难
  • ¥15 iis10中如何阻止别人网站重定向到我的网站
  • ¥15 滑块验证码移动速度不一致问题
  • ¥15 Utunbu中vscode下cern root工作台中写的程序root的头文件无法包含
  • ¥15 麒麟V10桌面版SP1如何配置bonding