dongyi6543 2013-03-28 15:07
浏览 36
已采纳

从Wiki XML语法中提取图像路径

i try to parse the Wikipedia XML which i get from the xml wikipedia export

In one case i need to extract all image path. The raw markup looks like,

  [[Bild:nameOfImage.png|image description]]

"Bild" can also be "Image", "File" or "Datei"

To extract the text for an Image i use this regex.

'|\[\[.*\|.*\]\]|U'

This works fine, if in the image description isn't an other '[[ .. ]]', like

[[Bild:nameOfImage.png|image Description with a [[new wiki link]] ]]

My question is, how can i modify the Regex to get all text between the first "[[" and the last "]]" without to count all '[' an ']' character.

thanks in advance

  • 写回答

1条回答 默认 最新

  • donljt2606 2013-03-28 15:50
    关注

    Since you're using PHP, you're probably able to use recursive patterns.
    Considering you're not capturing anything:

    /\[\[(((?>[^\[\]])|(?R))*)\]\]/U
    

    Note that I haven't tried this regex since I have no way to use PHP.

    Edit:

    preg_match('/\[\[(?>[^\[\]]|(?R))*\]\]/U', '[[Bild:nameOfImage.png|image Description with a [[new wiki link]] ]]', $array);
    var_dump($array);
    

    seems to work.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 matlab中使用gurobi时报错
  • ¥15 WPF 大屏看板表格背景图片设置
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么
  • ¥15 banner广告展示设置多少时间不怎么会消耗用户价值
  • ¥16 mybatis的代理对象无法通过@Autowired装填
  • ¥15 可见光定位matlab仿真
  • ¥15 arduino 四自由度机械臂