Python爬虫.*?匹配时的疑惑
网页源代码是这样的
这样过滤时r'<span class="title">(?P<name>.*?)</span>',re.S
会得到
”肖申克的救赎“
” / The Shawshank Redemption“
这两串
而如果在前面加上'<div class="item">*?
就能过滤掉后面的” / The Shawshank Redemption“这一串
我们知道.?是从<内容1>?<内容二>从前往后匹配过去,所以为什么在前面添加的过滤能滤掉后面的内容?
在span class="title"前面添加div class="item"过滤并显示内容
import requests
import re
url = "https://movie.douban.com/top250"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.42"
}
resp = requests.get(url,headers=headers)
web_code = resp.text
obj = re.compile(r'<div class="item">(?P<grb>.*?)<span class="title">(?P<name>.*?)</span>',re.S)
result = obj.finditer(web_code)
for i in result:
print(i.group("grb"))
print(i.group("name"))
<div class="pic">
<em class="">1</em>
<a href="https://movie.douban.com/subject/1292052/">
<img width="100" alt="肖申克的救赎" src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg" class="">
</a>
</div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1292052/" class="">
肖申克的救赎
#可以看到<div class="item">(?P<grb>.*?)过滤的内容里并不包含” / The Shawshank Redemption“这一串
但是如果前面不添加添加div class="item"
obj = re.compile(r'<span class="title">(?P<name>.*?)</span>',re.S)
result = obj.finditer(web_code)
for i in result:
print(i.group("name"))
肖申克的救赎
/ The Shawshank Redemption
#结果后面就多了 / The Shawshank Redemption这一串
请问<div class="item">(?P<grb>.*?)<span class="title">(?P<name>.*?)</span>中的
<div class="item">(?P<grb>.*?)是怎么做到过滤掉”肖申克的救赎“后面的
” / The Shawshank Redemption“这一串的?