Spider boy
2019-04-05 15:32python爬虫时为什么网页源码经过xpth处理后无法解析了呢
在爬取一个小说网站的时候我发现在网页的response中可以看到相关的值,但是在获取的时候就出现了问题
具体问题是这样的,
from lxml import etree
import requests
class Xiaoshuospider:
def __init__(self):
self.start_url = 'https://www.qiushuzw.com/t/38890/10253656.html'
self.headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Cookie": "BAIDU_SSP_lcr=https://www.80txt.com/txtml_38890.html; Hm_lvt_c0ce681e8e9cc7e226131131f59a202c=1554447305; Hm_lpvt_c0ce681e8e9cc7e226131131f59a202c=1554447305; UM_distinctid=169ec4788554ea-0eba8d0589d979-1a201708-15f900-169ec4788562c1; CNZZDATA1263995655=929605835-1554443240-https%253A%252F%252Fwww.80txt.com%252F%7C1554443240",
"Host": "www.qiushuzw.com",
"If-Modified-Since": "Thu, 31 Jan 2019 03:00:17 GMT",
"If-None-Match": 'W/"5c5264c1 - 3f30"',
"Referer": "https://www.80txt.com/txtml_38890.html",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
}
def parse(self):
res = requests.get(self.start_url,headers=self.headers).content.decode()
html = etree.HTML(res)
content = html.xpath("div[@class='book_content']/text()")
print(content)
def run(self):
self.parse()
if __name__ == '__main__':
xiaoshuo = Xiaoshuospider()
xiaoshuo.run()
- 根据xpath规则我将这些信息处理以后无法找到相应小说文本内容,小说的详细信息无法使用xpath提取出来
有没有哪位大佬也遇到相应的问题
- 点赞
- 回答
- 收藏
- 复制链接分享
0条回答
为你推荐
- python爬取动态网页时为什么动态网页的url的源码和网页源码不一样?
- python
- 正则表达式
- html5
- 1个回答
- 关于python网络爬虫网页失效的处理提问
- 正则表达式
- 1个回答
- Python爬虫,我用bs4的find方法为什么反回的是空值?怎么解决(已解决)?
- python
- 1个回答
- python爬虫 爬虫的网站源码不齐全怎么办
- python
- 1个回答
- python爬虫爬取数据存储进数据库的问题
- flask
- python
- mysql
- 3个回答
换一换