mess_mr 2023-02-13 20:37 采纳率: 0%
浏览 44

python爬虫一半失败

爬小说。只爬了10章就报错了。


import requests
from bs4 import BeautifulSoup

#获得章节链接、标题
def get_novel_chaters():
    root_url = "http://www.qixivur.com/news/ts48.html"
    r = requests.get(root_url)
    r.encoding="utf-8"
    soup = BeautifulSoup(r.text,"html.parser")

    data = []
    for dd in soup.find_all("dd"):
        link = dd.find("a")
        if not link:
            continue
        data.append(("http://www.qixivur.com%s"%link['href'],link.get_text()))
        # print(link)
    return data
#获得链接内容
def get_chapter_content(url):
    r = requests.get(url)
    r.encoding='utf-8'
    soup = BeautifulSoup(r.text, "html.parser")
    return soup.find('div',id="TextContent").get_text()

novel_chapters = get_novel_chaters()
total_cnt = len(novel_chapters)
idx = 0

for chapter in get_novel_chaters():
    # print(chapter)
    idx+=1
    print(idx,total_cnt)
    url,title = chapter
    with open("%s.txt"%title,"w",encoding="utf-8") as fout:
        fout.write(get_chapter_content(url))
1 1102
2 1102
3 1102
4 1102
5 1102
6 1102
7 1102
8 1102
9 1102
10 1102
Traceback (most recent call last):
  File "D:\Professional_documents\pythonProject\web_crawler\venv\lib\site-packages\requests\models.py", line 434, in prepare_url
    scheme, auth, host, port, path, query, fragment = parse_url(url)
  File "D:\Professional_documents\pythonProject\web_crawler\venv\lib\site-packages\urllib3\util\url.py", line 397, in parse_url
    return six.raise_from(LocationParseError(source_url), None)
  File "<string>", line 3, in raise_from
urllib3.exceptions.LocationParseError: Failed to parse: http://www.qixivur.comjavascript:;

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:/Professional_documents/pythonProject/web_crawler/爬小说/main.py", line 36, in <module>
    fout.write(get_chapter_content(url))
  File "D:/Professional_documents/pythonProject/web_crawler/爬小说/main.py", line 21, in get_chapter_content
    r = requests.get(url)
  File "D:\Professional_documents\pythonProject\web_crawler\venv\lib\site-packages\requests\api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "D:\Professional_documents\pythonProject\web_crawler\venv\lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "D:\Professional_documents\pythonProject\web_crawler\venv\lib\site-packages\requests\sessions.py", line 573, in request
    prep = self.prepare_request(req)
  File "D:\Professional_documents\pythonProject\web_crawler\venv\lib\site-packages\requests\sessions.py", line 496, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "D:\Professional_documents\pythonProject\web_crawler\venv\lib\site-packages\requests\models.py", line 368, in prepare
    self.prepare_url(url, params)
  File "D:\Professional_documents\pythonProject\web_crawler\venv\lib\site-packages\requests\models.py", line 436, in prepare_url
    raise InvalidURL(*e.args)
requests.exceptions.InvalidURL: Failed to parse: http://www.qixivur.comjavascript:;

Process finished with exit code 1
  • 写回答

2条回答 默认 最新

  • cjh4312 2023-02-13 21:49
    关注
    
    import requests
    from lxml import etree
    url='http://www.qixivur.com/news/ts48.html'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    dd=requests.get(url,headers)
    html=etree.HTML(dd.text)
    name=html.xpath('//*[@id="list-chapterAll"]/dl/dd/a//text()')
    targets=html.xpath('//*[@id="list-chapterAll"]/dl/dd/a/@href')
    for n,i in enumerate(targets):
        dd=requests.get(f"http://www.qixivur.com{i}")  
        html=etree.HTML(dd.text)
        data=html.xpath('//*[@id="TextContent"]/p//text()')
        s='\n   '.join(str(j) for j in data)
        with open(f'e:/novel/{name[n]}.txt','w+',encoding='utf-8') as file:
            file.write(s)
            file.close()
    

    img

    评论 编辑记录

报告相同问题?

问题事件

  • 修改了问题 2月13日
  • 修改了问题 2月13日
  • 创建了问题 2月13日

悬赏问题

  • ¥15 MSR2680-XS路由器频繁卡顿问题
  • ¥15 VB6可以成功读取的文件,用C#读不了
  • ¥15 如何使用micpyhon解析Modbus RTU返回指定站号的湿度值,并确保正确?
  • ¥15 C++ 句柄后台鼠标拖动如何实现
  • ¥15 有人会SIRIUS 5.8.0这个软件吗
  • ¥30 comsol仿真等离激元
  • ¥15 静电纺丝煅烧后如何得到柔性纤维
  • ¥15 (标签-react native|关键词-镜像源)
  • ¥100 照片生成3D人脸视频
  • ¥15 伪装视频时长问题修改MP4的时长问题,