记一次疑问:在使用requests库爬取百度搜索关键字结果页面时,使用完整关键字url就能返回成功,若使用param参数将关键字加载在get()内即返回百度安全验证页面,爬取失败。具体是什么原因?
问题相关代码,请勿粘贴截图
使用完整url,可爬取成功代码:
-- coding:utf-8 --
import requests
if name =="main":
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
}
keyword = input("enter a word:")
url = 'https://www.baidu.com/s?'+'wd='+keyword
response = requests.get(url=url , headers=headers)
response.encoding='utf-8'
page_text = response.text
filename = 'python.html'
with open(filename,'w',encoding='utf-8') as fp:
fp.write(page_text)
print(filename,"保存成功!!")
使用param参数,返回失败代码:
-- coding:utf-8 --
import requests
if name =="main":
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
}
kw = input("enter a word:")
param ={
'param':kw
}
url = 'https://www.baidu.com/s?wd'
response = requests.get(url=url ,params=param, headers=headers)
response.encoding='utf-8'
page_text = response.text
filename = 'python.html'
with open(filename,'w',encoding='utf-8') as fp:
fp.write(page_text)
print(filename,"保存成功!!")
失败结果:
分别尝试过使用param和未使用param,结果不一样。但是搜狗不论加不加param都可以访问。加了Accept仍然没用,区别就在requests.get()里面,不知道具体原因是什么。
想知道到底是为啥,百度用的反爬是啥逻辑