我在学习爬虫的时候遇到需要从一段话中提取出图片网址的需求。
待提取文字:
t1 = """ <div class="thumb"><a href="/article/123954862" target="_blank"> <img src="//pic.qiushibaike.com/system/pictures/12395/123954862/medium/L62DIHT1AV2DKIUV.jpg" alt="糗事#123954862" class="illustration" width="100%" height="auto"> </a> </div> """
正则:
ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
然后在一些正则网站中无法识别
但是在Python的re模块中使用语句就可以识别:
t1 = """ <div class="thumb"><a href="/article/123954862" target="_blank"> <img src="//pic.qiushibaike.com/system/pictures/12395/123954862/medium/L62DIHT1AV2DKIUV.jpg" alt="糗事#123954862" class="illustration" width="100%" height="auto"> </a> </div> """ ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>' img_src_list = re.findall(ex,t1,re.S)
就可以提取到呢?是不是因为换行符号的原因