今天在写一个爬虫程序
最开始我能爬取到数据,代码如下
# 爬取帖子列表信息
def getData(baseurl):
datalist = []
# 末尾页2985
for i in range(0,1):
# 帖子列表地址
url = baseurl + str(i*50)
soup = askURL(url)
print(soup)
# 逐一解析数据
for item in soup.find_all('li',class_="j_thread_list clearfix thread_item_box"):
# 保存一个帖子的信息
data = []
item = str(item)
replyNum = re.findall(findReplyNum,item)[0]
data.append(replyNum) # 添加评论数
title = re.findall(findTitle,item)[0]
data.append(title) # 添加帖子名称
link = re.findall(findLink,item)[0]
link = "https://tieba.baidu.com/" + link # 拼接为完整链接
data.append(link) # 添加链接
datalist.append(data)
return datalist
# 得到一个指定URL的网页内容
def askURL(url):
html = requests.get(url,verify=False)
soup = BeautifulSoup(html.content, 'html.parser')
return soup
但后来应该是我的IP被锁,运行后爬取得到的soup是网络不给力的页面。
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="utf-8"/>
<title>百度安全验证</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="black" name="apple-mobile-web-app-status-bar-style"/>
<meta content="width=device-width, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0" name="viewport"/>
<meta content="telephone=no, email=no" name="format-detection"/>
<link href="https://www.baidu.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="https://www.baidu.com/img/baidu.svg" mask="" rel="icon" sizes="any"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="upgrade-insecure-requests" http-equiv="Content-Security-Policy"/>
<link href="https://ppui-static-wap.cdn.bcebos.com/static/touch/css/api/mkdjump_aac6df1.css" rel="stylesheet">
</link></head>
<body>
<div class="timeout hide">
<div class="timeout-img"></div>
<div class="timeout-title">网络不给力,请稍后重试</div>
<button class="timeout-button" type="button">返回首页</button>
</div>
<div class="timeout-feedback hide">
<div class="timeout-feedback-icon"></div>
<p class="timeout-feedback-title">问题反馈</p>
</div>
<script src="https://wappass.baidu.com/static/machine/js/api/mkd.js"></script>
<script src="https://ppui-static-wap.cdn.bcebos.com/static/touch/js/mkdjump_db105ab.js"></script>
</body>
</html>
于是我伪装了一个请求头
head = {
"User-Agent":...
"Cookie":...
}
html = requests.get(url,verify=False,headers=head)
这时soup中有完整的页面类容,但soup.find_all就成了一个空列表。
请问有朋友知道怎么解决吗,为什么不加headers时soup.find_all就是正常的