qq_56038802 2022-04-04 17:42 采纳率: 25%
浏览 26

爬虫异步操作异常处理try判断出错

用爬虫爬取ip代理网站的ip时,用try判断ip是否超时,结果在session.get()前面加了一个await 好像并没有进行判断 timeout设置为3s 好像直接跳转到了except中去 所有的ip全部被打印成不可用 但是我自己测试的时候发现是能用的 而且代码执行的很快 压根没有检测是否超时

代码如下

import asyncio
import json
import requests
from bs4 import BeautifulSoup
import aiohttp
import aiofiles
import asyncore
import json
from lxml import etree
import random

async def get_ip(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as f:
            a = await f.text()
            bsl = BeautifulSoup(a,'html.parser')
            bss = bsl.find('table',width="100%").select('tr')[1:]
            for list in bss:
                ip = list.select('tr td')[0].text
                port = list.select('tr td')[1].text
                proxies={
                    f'https':f'https://{ip}:{port}'
                }
                asyncio.gather(verify(proxies))


async def verify(proxies):
    async with aiohttp.ClientSession() as session:
        try:
            f = session.get('https://www.baidu.com',proxies=random.choice(proxies),async_timeout = 3)
            print('可用代理:{}'.format(proxies))
            await write_json(proxies)
        except:
            print('不可用的:{}'.format(proxies))



async def write_json(proxies):
    async with aiofiles.open('ip处理池.json','a') as f:
        await json.dump(proxies,f)


async def rea_json():
    async with aiofiles.open('ip处理池.json','r')as f:
        for i in f.readlines():
            content = json.loads(i.strip())
            print(content)


async def main():
    tasks = []
    for i in range(100):
        url = f'http://www.66ip.cn/{i}.html'
        tasks.append(asyncio.create_task(get_ip(url)))
    await asyncio.wait(tasks)



if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())



  • 写回答

1条回答 默认 最新

  • ~白+黑 Python领域新星创作者 2022-04-04 21:05
    关注
            f = session.get('https://www.baidu.com',proxies=random.choice(proxies),async_timeout = 3) random.choice会直接出错
    
    
    async def get_ip(url):
    
        async with aiohttp.ClientSession() as session:
    
            async with session.get(url) as f:
    
                a = await f.text()
    
                bsl = BeautifulSoup(a,'html.parser')
    
                bss = bsl.find('table',width="100%").select('tr')[1:]
    
                for list in bss:
    
                    ip = list.select('tr td')[0].text
    
                    port = list.select('tr td')[1].text
    
                    proxies={
    
                        f'https':f'https://{ip}:{port}'
    
                    }
    
                    asyncio.gather(verify(proxies))#这个地方没必要并发了,每次就一个函数实例,直接await verify(proxies)就行了吧,
    
    async def verify(proxies):
    
        async with aiohttp.ClientSession() as session:
    
            try:
                #random.choice(proxies),每次字典里只有一个数据没必要再随机取出,这个位置会出错,而且字典类型是不被random。choice支持的
                f = session.get('https://www.baidu.com',proxies=random.choice(proxies),async_timeout = 3)
    
                print('可用代理:{}'.format(proxies))
    
                await write_json(proxies)
    
            except:建议打印明确错类型来调试,except exception as e
    
                print('不可用的:{}'.format(proxies))
    
    评论 编辑记录

报告相同问题?

问题事件

  • 创建了问题 4月4日

悬赏问题

  • ¥15 圆孔衍射光强随孔径变化
  • ¥15 MacBook pro m3max上用vscode运行c语言没有反应
  • ¥15 ESP-PROG配置错误,ALL ONES
  • ¥15 结构功能耦合指标计算
  • ¥50 AI大模型精调(百度千帆、飞浆)
  • ¥15 非科班怎么跑代码?如何导数据和调参
  • ¥15 福州市的全人群死因监测点死亡原因报表
  • ¥15 Altair EDEM中生成一个颗粒,并且各个方向没有初始速度
  • ¥15 系统2008r2 装机配置推荐一下
  • ¥15 悬赏Python-playwright部署在centos7上