在使用asyncio异步编程httpx时遇到原因不明的错误导致目标页面源代码获取失败,细节如下:
问题代码用于下载wallhaven.cc上的图片,图片的编号被正确地储存在同目录下的file.txt中。程序会读取这个文件,并根据文件所存储的图片编号生成该图片所在页面的网址,获取这个网址的源代码并在源代码中找到图片元素,下载并存储图片。
由于图片本身所在的网址不规律,但图片所在页面的网址是规律的,所以只好先找图片所在页面的网址,再从这个网址中找到图片元素的地址。
测试数据如下:
2e31px
Errortest
k9v3om
3k62g3
8x967o
2k9lqy
j813pm
rrjvyq
7prdye
5gr1w5
qzlwk5
其中对于2e31px这一图片编号,正常应该有一个404状态,可能因为某种原因,这张图片在网站上不再可用。
对于Errortest这一被当成图片编号的测试数据,应当会有index out of range 错误,因为使用这个编号生成的网址所对应的页面是网站的错误提示,不存在图片文件。
对于其他编号,正常情况下程序应当可以通过生成的地址访问一个含有图片的界面,并找到、下载这张图片到与代码相同的目录下。
但是对于如下代码,运行时出现不明原因的错误
import os
import random
import httpx
import asyncio
from lxml import html
from asyncio import Semaphore
semaphore = Semaphore(7)
async def record_error(url, error, error_file_path):
with open(error_file_path, 'a') as error_file:
error_file.write(f"{url}\n")
error_file.write(f"{error}\n\n")
async def get_page(url, session):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:109.0) Gecko/20100101 Firefox/115.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Connection': 'keep-alive'
}
try:
async with await session.get(url, headers=headers, timeout=60) as res:
print("DBTAG 5")
# 取消括号,改为 res.text
return await res.text
except Exception as e:
print(f"Error occurred while getting the page: {e}")
await record_error(url, str(e), error_file_path)
return None
async def download(url, session, error_file_path):
try:
print(f"Downloading image: {url}...")
print("Setting headers...")
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:109.0) Gecko/20100101 Firefox/115.0',
'Host': 'w.wallhaven.cc',
'Accept': 'image/avif,image/webp,*/*',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Referer': 'https://wallhaven.cc/'
}
print("session getting...")
res = await session.get(url, headers=headers, timeout=60)
assert res.status_code == 200
res.raise_for_status()
print(f"Succeeded in downloading {url}")
return res.text
except Exception as e:
fail = os.path.basename(url)
print(f"Download failed: {e}, {fail}")
await record_error(url, str(e), error_file_path)
return None
async def Cycle(session, line, error_file_path):
try:
url = f'https://wallhaven.cc/w/{line}'
print("DBTAG 2")
html_ = await get_page(url, session)
print("DBTAG 3")
print(url)
print(html_)
image_urls = html.fromstring(html_).xpath('//img/@src')
image_elem = await download(image_urls[2], session, error_file_path)
print("DBTAG 4")
if image_elem:
filename = os.path.basename(image_urls[2])
with open(filename, 'wb') as f:
f.write(image_elem.content)
except Exception as e:
print(f"Error occurred: {e}")
fail = os.path.basename(url)
print(f"Download failed: {fail}")
await record_error(url, str(e), error_file_path)
if __name__ == "__main__":
current_directory = os.getcwd()
print("当前工作目录:", current_directory)
file_path = os.path.join(current_directory, 'file.txt')
with open('file.txt', 'r', buffering=20971520) as file:
print("reading the file...")
lines = file.read().splitlines()
print("Initializing error saver.")
error_file_path = os.path.join(current_directory, 'Error_Path.txt')
with open(error_file_path, 'w') as error_file:
error_file.write("Error Messages:\n")
async def main():
async with httpx.AsyncClient() as session:
print("DBTAG 1")
tasks = [Cycle(session, line, error_file_path) for line in lines]
await asyncio.gather(*tasks)
asyncio.run(main())
print('Completed.')
运行输出如下:
当前工作目录: I:\...(此处省略)
reading the file...
Initializing error saver.
DBTAG 1
DBTAG 2
DBTAG 2
DBTAG 2
DBTAG 2
DBTAG 2
DBTAG 2
DBTAG 2
DBTAG 2
DBTAG 2
DBTAG 2
DBTAG 2
Error occurred while getting the page: __aexit__
DBTAG 3
https://wallhaven.cc/w/5gr1w5
None
Error occurred: expected string or bytes-like object
Download failed: 5gr1w5
Error occurred while getting the page: __aexit__
DBTAG 3
https://wallhaven.cc/w/rrjvyq
None
Error occurred: expected string or bytes-like object
Download failed: rrjvyq
Error occurred while getting the page: __aexit__
DBTAG 3
https://wallhaven.cc/w/Errortest
None
Error occurred: expected string or bytes-like object
Download failed: Errortest
Error occurred while getting the page: __aexit__
DBTAG 3
https://wallhaven.cc/w/qzlwk5
None
Error occurred: expected string or bytes-like object
Download failed: qzlwk5
Error occurred while getting the page: __aexit__
DBTAG 3
https://wallhaven.cc/w/2e31px
None
Error occurred: expected string or bytes-like object
Download failed: 2e31px
Error occurred while getting the page: __aexit__
DBTAG 3
https://wallhaven.cc/w/j813pm
None
Error occurred: expected string or bytes-like object
Download failed: j813pm
Error occurred while getting the page: __aexit__
DBTAG 3
https://wallhaven.cc/w/k9v3om
None
Error occurred: expected string or bytes-like object
Download failed: k9v3om
Error occurred while getting the page: __aexit__
DBTAG 3
https://wallhaven.cc/w/3k62g3
None
Error occurred: expected string or bytes-like object
Download failed: 3k62g3
Error occurred while getting the page: __aexit__
DBTAG 3
https://wallhaven.cc/w/8x967o
None
Error occurred: expected string or bytes-like object
Download failed: 8x967o
Error occurred while getting the page: __aexit__
DBTAG 3
https://wallhaven.cc/w/7prdye
None
Error occurred: expected string or bytes-like object
Download failed: 7prdye
Error occurred while getting the page: __aexit__
DBTAG 3
https://wallhaven.cc/w/2k9lqy
None
Error occurred: expected string or bytes-like object
Download failed: 2k9lqy
Completed.
请问是何种原因导致了错误,如何解决?
(本人初中文凭,属于初学者,烦请答主讲得通俗些)