Laqide 2023-02-28 18:58 采纳率: 50%
浏览 38
已结题

scrapy使用代理后出现字符格式错误

scrapy使用代理后,报错:

2023-02-28 18:52:18 [scrapy.core.scraper] ERROR: Error downloading <GET http://guba.eastmoney.com/list,300059_1.html>
Traceback (most recent call last):
  File "C:\Users\18310\AppData\Local\Programs\Python\Python310\lib\site-packages\twisted\internet\defer.py", line 1693, in _inlineCallbacks
    result = context.run(
  File "C:\Users\18310\AppData\Local\Programs\Python\Python310\lib\site-packages\twisted\python\failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "C:\Users\18310\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\downloader\middleware.py", line 52, in process_request
    return (yield download_func(request=request, spider=spider))
  File "C:\Users\18310\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\utils\defer.py", line 73, in mustbe_deferred
    result = f(*args, **kw)
  File "C:\Users\18310\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\downloader\handlers\__init__.py", line 79, in download_request
    return handler.download_request(request, spider)
  File "C:\Users\18310\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 72, in download_request
    return agent.download_request(request)
  File "C:\Users\18310\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 363, in download_request
    agent = self._get_agent(request, timeout)
  File "C:\Users\18310\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 327, in _get_agent
    proxyScheme, proxyNetloc, proxyHost, proxyPort, proxyParams = _parse(proxy)
  File "C:\Users\18310\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\downloader\webclient.py", line 39, in _parse
    return _parsed_url_args(parsed)
  File "C:\Users\18310\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\core\downloader\webclient.py", line 20, in _parsed_url_args
    host = to_bytes(parsed.hostname, encoding="ascii")
  File "C:\Users\18310\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\utils\python.py", line 111, in to_bytes
    return text.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in position 0: ordinal not in range(128)

我的中间件为:

class RandomProxyMiddleware(HttpProxyMiddleware):
        # proxy从settings.py中读取PROXY
    def __init__(self, auth_encoding='utf-8', proxy_list=None):
        self.proxy = settings.PROXY


    def process_request(self, request, spider):
        # 随机选择一个代理IP
        proxy = random.choice(self.proxy)
        # 判断代理IP是否可用
        if self.check_proxy(proxy):
            print('当前使用的代理IP是:', proxy)
            request.meta['proxy'] = proxy

        else:
            self.process_request(request, spider)

    def check_proxy(self, proxy):
        # 判断代理IP是否可用
        try:
            # 设置超时时间为3秒
            requests.get('https://www.eastmoney.com/', proxies={'http': proxy}, timeout=3)
            return True
        except:
            return False

使用的Ip可以访问“'https://www.eastmoney.com/%E2%80%9D
请问这个错误该怎么解决?

  • 写回答

5条回答 默认 最新

  • 「已注销」 2023-02-28 19:08
    关注

    参考GPT和自己的思路,这个错误可能是因为代理IP中包含了非ASCII字符,而Scrapy使用了'ascii'编码对其进行编码,因此出现了Unicode编码错误。你可以尝试使用其他编码,如'utf-8'或'gbk',来编码代理IP的字符串。

    你可以在代理IP读取的部分进行修改,例如:

    class RandomProxyMiddleware(HttpProxyMiddleware):
    
        def __init__(self, auth_encoding='utf-8', proxy_list=None):
            self.proxy = settings.PROXY
    
            # 将代理IP编码为'utf-8'格式
            self.proxy = [p.encode('utf-8') for p in self.proxy]
    
        def process_request(self, request, spider):
            # 随机选择一个代理IP
            proxy = random.choice(self.proxy)
            # 判断代理IP是否可用
            if self.check_proxy(proxy):
                print('当前使用的代理IP是:', proxy)
                request.meta['proxy'] = proxy
    
            else:
                self.process_request(request, spider)
    
        def check_proxy(self, proxy):
            # 将代理IP解码为'utf-8'格式
            proxy = proxy.decode('utf-8')
            # 判断代理IP是否可用
            try:
                # 设置超时时间为3秒
                requests.get('https://www.eastmoney.com/', proxies={'http': proxy}, timeout=3)
                return True
            except:
                return False
    

    在这个例子中,我们将代理IP的编码方式设置为'utf-8',并在代理IP的读取和使用过程中进行编码和解码。这样做应该可以避免这个Unicode编码错误。

    评论 编辑记录

报告相同问题?

问题事件

  • 系统已结题 3月8日
  • 创建了问题 2月28日

悬赏问题

  • ¥20 测距传感器数据手册i2c
  • ¥15 RPA正常跑,cmd输入cookies跑不出来
  • ¥15 求帮我调试一下freefem代码
  • ¥15 matlab代码解决,怎么运行
  • ¥15 R语言Rstudio突然无法启动
  • ¥15 关于#matlab#的问题:提取2个图像的变量作为另外一个图像像元的移动量,计算新的位置创建新的图像并提取第二个图像的变量到新的图像
  • ¥15 改算法,照着压缩包里边,参考其他代码封装的格式 写到main函数里
  • ¥15 用windows做服务的同志有吗
  • ¥60 求一个简单的网页(标签-安全|关键词-上传)
  • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法