StaysOnEarth 2023-12-13 19:30 采纳率: 0%
浏览 13
已结题

(标签-Python)爬虫过程中遇到TypeError: object of type 'NoneType' has no len()

Python 使用BeautifulSoup过程中遇到TypeError: object of type 'NoneType' has no len()

需求:获取url_list里每一个url的文件大小、类型和outlink的数量


```python
def getHTML(url, ua_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36', num_retries = 5):
    headers = {'User-Agent': ua_agent}
    request = urllib.request.Request(url=url, headers=headers)
    html = None
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode('utf-8')
    except urllib.error.URLError or urllib.error.HTTPError as e:
        if num_retries > 0:
            if hasattr(e,'code') and 500 <= e.code < 600:
                getHTML(url, ua_agent, num_retries - 1)
    return html

这里打印html返回None,导致BeautifulSoup获取url出错
def get_url_num(html):
    links = []
    soup = BeautifulSoup(html,'html.parser')
    url_list = soup.find_all('a')
    for link in url_list:
        link = link.get('href')
        if link.startswith('http'):
            links.append(link)
    url_num = len(links)
    return url_num


以下为原代码:

import requests
import pandas as pd
import urllib.error
import urllib.request
import ssl
from bs4 import BeautifulSoup

ssl._create_default_https_context = ssl._create_unverified_context


def getHTML(url, ua_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36', num_retries = 5):
    headers = {'User-Agent': ua_agent}
    request = urllib.request.Request(url=url, headers=headers)
    html = None
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode('utf-8')
    except urllib.error.URLError or urllib.error.HTTPError as e:
        if num_retries > 0:
            if hasattr(e,'code') and 500 <= e.code < 600:
                getHTML(url, ua_agent, num_retries - 1)
    return html


def get_url_num(html):
    links = []
    soup = BeautifulSoup(html,'html.parser')
    url_list = soup.find_all('a')
    for link in url_list:
        link = link.get('href')
        if link.startswith('http'):
            links.append(link)
    url_num = len(links)
    return url_num


df = pd.read_csv('fetch_nytimes.csv')

url_list = []
for i in df['URL']:
    url_list.append(i)

print(url_list)

headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
}

size_list = []
type_list = []
outlinks_list = []

for url in url_list:
    try:
        response = requests.get(url, stream=True, headers=headers)
        # column 2
        file_size = response.headers['Content-Length']
        file_size = int(file_size)/1024/1024
        size_list.append(file_size)
        # column 4
        file_type = response.headers['Content-Type']
        type_list.append(file_type)
        # column 3
        html = getHTML(url)
        outlinks = get_url_num(html)
        outlinks_list.append(outlinks)
    except urllib.error.URLError or urllib.error.HTTPError as e:
        url_list.remove(url)

dict_visit = {}
dict_visit['URL'] = url_list
dict_visit['File_Size'] = size_list
dict_visit['Outlinks'] = outlinks_list
dict_visit['File_Type'] = type_list

df1 = pd.DataFrame(dict_visit)
df1.to_csv('visit_nytimes.csv', index=False)

```

  • 写回答

17条回答 默认 最新

  • 专家-赤兔[在线] 全栈领域优质创作者 2023-12-13 19:30
    关注
    获得0.60元问题酬金

    引自强大而免费的微信小程序:皆我百晓生

    问题分析:
    根据提供的代码发现,问题出现在获取HTML内容的函数getHTML中,当出现异常时没有正确处理,导致返回的HTML内容为None。在get_url_num函数中调用了getHTML函数,因此无法获取正确的HTML内容,进而出现了TypeError: object of type 'NoneType' has no len()的错误。

    解决方案:
    getHTML函数的异常处理部分,需要对异常进行正确处理,并在发生异常时进行递归调用,以达到最大重试次数。同时,为了避免无限递归,需要在递归调用之前将num_retries减1。

    以下是修改后的代码示例:

    def getHTML(url, ua_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36', num_retries=5):
        headers = {'User-Agent': ua_agent}
        request = urllib.request.Request(url=url, headers=headers)
        html = None
        try:
            response = urllib.request.urlopen(request)
            html = response.read().decode('utf-8')
        except (urllib.error.URLError, urllib.error.HTTPError) as e:  # 使用括号将异常类型括起来
            if num_retries > 0:
                if hasattr(e, 'code') and 500 <= e.code < 600:
                    return getHTML(url, ua_agent, num_retries - 1)  # 递归调用时将num_retries减1
        return html
    
    # 修改后的getHTML函数会返回正确的HTML内容,解决了TypeError的问题
    

    希望能够帮助到你!如果有任何疑问,请继续追问。

    评论 编辑记录

报告相同问题?

问题事件

  • 系统已结题 12月21日
  • 创建了问题 12月13日

悬赏问题

  • ¥20 公众号如何实现点击超链接后自动发送文字
  • ¥15 用php隐藏类名和增加类名
  • ¥15 算法设计与分析课程的提问
  • ¥15 用MATLAB汇总拟合图
  • ¥15 智能除草机器人方案设计
  • ¥15 对接wps协作接口实现消息发送
  • ¥15 SQLite 出现“Database is locked” 如何解决?
  • ¥15 已经加了学校的隶属邮箱了,为什么还是进不去github education?😭
  • ¥15 求会做聚类,TCN的朋友有偿线上指导。以下是目前遇到的问题
  • ¥100 无网格伽辽金方法研究裂纹扩展的程序