问题遇到的现象和发生背景Pycharn爬取网页报错
问题相关代码,请勿粘贴截图
import requests
import requests.cookies
import json
import time
import pandas as pd
cookie_jar=requests.cookies.RequestsCookieJar()
with open("cookies.txt")as fin:
cookiejson=json.loads(fin.read())
for cookie in cookiejson:
cookie_jar.set(
name=cookie["name"],
value=cookie["value"],
domain=cookie["domain"],
path=cookie["path"]
)
htmls=[]
url="https://dict.youdao.com/webwordbook/wordlist?p={idx}&tags="
for idx in range(2):
time.sleep(1)
print("**爬取数据:第%d页"%idx)
r=requests.get(url.format(idx=idx),cookies=cookie_jar)
htmls.append(r.text)
df_list=[]
for html in htmls:
df=pd.read_html(html)
df_cont=df[1]
df_cont.columns=df[0].colums
df_list.append(df_cont)
运行结果及报错内容
C:\Users\Administrator\Desktop\test\Scripts\python.exe C:/Users/Administrator/Desktop/test/yuyue.py
**爬取数据:第0页
**爬取数据:第1页
Traceback (most recent call last):
File "C:\Users\Administrator\Desktop\test\yuyue.py", line 25, in <module>
df=pd.read_html(html)
File "C:\Users\Administrator\Desktop\test\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\Administrator\Desktop\test\lib\site-packages\pandas\io\html.py", line 1098, in read_html
return _parse(
File "C:\Users\Administrator\Desktop\test\lib\site-packages\pandas\io\html.py", line 902, in _parse
parser = _parser_dispatch(flav)
File "C:\Users\Administrator\Desktop\test\lib\site-packages\pandas\io\html.py", line 851, in _parser_dispatch
raise ImportError("html5lib not found, please install it")
ImportError: html5lib not found, please install it
进程已结束,退出代码1
我的解答思路和尝试过的方法:应该是第二十五行有问题,但具体不清楚是什么样的问题
我想要达到的结果:顺利爬取网页