我的课程作业需要爬取这个网址(http://guba.hzzkzx.com/list,002603,f_1.html)的数据,但是这个网站可能是有反爬虫机制,返回的不是原网页的内容,而是一个html,里面是带有这个网址的javascript。请问这个网站的反爬虫机制是怎样的?如何绕过?
程序源代码:
import requests
from bs4 import BeautifulSoup
headers={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Cookie': '__guid=84635791.2115898957883613700.1616987444460.9778; monitor_count=1',
'DNT': '1',
'Host': 'guba.hzzkzx.com',
'Referer': 'http://guba.hzzkzx.com/list,002603,f_1.html',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
def get_data(url):
html=requests.get(url,headers=headers)
soup=BeautifulSoup(html.text,'lxml')
print(soup.prettify())
if __name__=='__main__':
url="http://guba.hzzkzx.com/list,002603,f_1.html"
get_data(url)
运行结果:
<html>
<head>
<script type="text/javascript">
function f(){window.location.href="http://guba.hzzkzx.com/list,002603,f_1.html";}
</script>
</head>
<body onload="f()">
<img src="http://tieba.baidu.com/_PXCK_77735440797141500_1558696096.gif" style="display:none"/>
</body>
</html>