CZH1479196082
2022-06-18 14:23
采纳率: 0%
浏览 12

文本爬虫,爬取不了文本

小说文本爬取——文件创建成功,但无内容
代码如下:

import requests
from bs4 import BeautifulSoup

def geturl():
url="http://www.wuxia.net.cn/book/baidicheng.html"
header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0"}
req = requests.get(url = url ,headers = header )
req.encoding = "utf-8"
html = req.text
bes = BeautifulSoup(html,"lxml")
texts = bes.find("div",id="main")
chapters = texts.find_all("a")
print(chapters)
words = []
for chapter in chapters:
if chapter.parent.name == "dd":
name = chapter.string
url1 = "http://www.wuxia.net.cn" + chapter.get("href")
word = [url1,name]
words.append(word)
return words

if name == 'main':
target = geturl()
header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0"}
for tar in target:
req = requests.get(url =tar[0],headers = header)
req.encoding = 'utf-8'
html = req.text
bes = BeautifulSoup(html,'lxml')
texts = bes.find("div",id = "container")
texts_list = texts.text.split("\xa0"*4)
print(type(texts_list))
with open("D:/储存库/代码空间/代码运行/"+tar[1] +".txt","w") as file:
for line in texts_list:
print(line+"\n")

应该怎么修改,才能将小说文本爬取到对应的文件中?

3条回答 默认 最新

相关推荐 更多相似问题