批量爬取数据中报错list index out of range（索引本身没问题）怎么办

问题遇到的现象和发生背景：
批量爬取某网站的MP4文件，在写代码过程中，对存在列表索引的地方都专门做了print测试，测试结果都正常，完全不存在list=[]或者list=[1，2，3]而我请求list[3]这样的情况，但是将代码整体运行时，总是报错list index out of range，而且每次报错的时间都不一样，有时保存三个MP4文件后报错，有时保存七个MP4文件时报错，也有一次直接连第一个MP4文件都没保存下来就报错了，报错如下（报错内容每次都是这些，一个字母都没变）：

D:\python\python.exe C:/Users/PycharmProjects/尝试/临时试验.py
Traceback (most recent call last):
  File "C:/Users/PycharmProjects/尝试/临时试验.py", line 73, in <module>
    main(bv_id=bvid)
  File "C:/Users/PycharmProjects/尝试/临时试验.py", line 56, in main
    video_info = get_videoinfo(url=laosepi)
  File "C:/Users/PycharmProjects/尝试/临时试验.py", line 23, in get_videoinfo
    title = re.findall('<h1 id="video-title" title="(.*?)" class="video-title">', resp.text)[0].replace(' ', '')
IndexError: list index out of range

Process finished with exit code 1

表面上看，直接原因就是 title = re.findall……这个正则语法的问题，但是对于报错的，我都用同样的正则语法单独拿出来打印过对应标题，是可以取到的，所以正如下面某位答主所言，根本原因可能出在requests.get上，有时不能正常返回数据，根据以往经验，考虑到可能是对于频繁请求有反爬机制，于是加入了time模块，在resp = get_resp(url)后边添加了time.sleep(3)，测试发现报错的时间还是随机，同一个标题有时可以取到有时就取不到。
于是再次尝试修改，将取不到title的都命名为000：

目的是判断是不是resp.get没有正常返回数据，思路如下：
既然说resp.get没有正常返回内容，那必定是title、audio_url, video_url这三个文本信息都取不到（因为三个数据都同一个resp.text里边），但是这时出现的结果更令我迷惑，只有标题有时是取不到的，另外两个数据每次都可以取到的。有人肯定又会说是我取标题的正则语法有问题，再强调一下，对于同一个标题有时可以取到有时取不到，所以不是正则语法的问题（以下是连续两次尝试的截图）：

以下是我尝试的全部代码（再强调一下，对于报错卡住的地方，我都尝试单独爬过那一个，可以爬到，所以我的正则语法不存在问题。为防止发不出来有些地方以abcdefg代替）：

import requests
import re
import json  
import subprocess  
import os 
import time


def get_resp(url):
    headers = {
        'referer': 'abcdefg',
        'user-agent': 'abcdefg'
    }
    resp = requests.get(url=url, headers=headers)
    return resp


def get_videoinfo(url):
    resp = get_resp(url)
    title = re.findall('<h1 id="video-title" title="(.*?)" class="video-title">', resp.text)[0].replace(' ', '')
    video_data = re.findall('<script>window.__playinfo__=(.*?)</script>', resp.text)[0]

    json_data = json.loads(video_data)
    audio_url = json_data['data']['*****']['audio'][0]['****']
    video_url = json_data['data']['*****']['video'][0]['****']
    video_info = [title, audio_url, video_url]
    time.sleep(1)
    return video_info


def save(title, audio_url, video_url):
    audio_content = get_resp(url=audio_url).content
    video_content = get_resp(url=video_url).content
    with open('picture\\' + title + '.mp3', 'wb') as f:
        f.write(audio_content)
    with open('picture\\' + title + '.mp4', 'wb') as f:
        f.write(video_content)
    print('保存完成')
    ffmpeg = f'ffmpeg -i picture\\{title}.mp4 -i picture\\{title}.mp3 -c:v copy -c:a aac -strict experimental picture//{title}output.mp4'
    subprocess.run(ffmpeg, shell=True)
    os.remove(f'picture\\{title}.mp4')
    os.remove(f'picture\\{title}.mp3')
    print(f'第{n}个合成成功')
    time.sleep(1)


def main(id):
    laosepi = f'abcdefg/{id}'
    video_info = get_videoinfo(url=laosepi)
    save(video_info[0], video_info[1], video_info[2])


# 函数入口
if __name__ == '__main__':
    n=1
    for page in range(1,6):
        index_url = f'abcdefg{page}abcdefg'
        json_data = get_resp(url=index_url).json()
        id_list = [i['****'] for i in json_data['data']['list']['****']]
        for every_id in id_list:
            main(id=every_id)
            n+=1
            time.sleep(1)

我这个情况到底该怎么解决？如果有有效办法，测试成功后立即采纳（再强调一下：其一、对于报错卡住的地方，我都尝试单独爬过那一个，可以爬到，所以排除我的正则语法不正确的问题；其二、三个数据都在同一个requests得到的resp.text里边，通过上面的尝试，说明只有title有时是取不到的，其他两个数据每次都能取到，所以我认为能够排除requests没有正常返回内容）

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
一切因为有你 2022-05-25 22:28
关注
你是这句报的错， title = re.findall('<h1 id="video-title" title="(.*?)" class="video-title">', resp.text)[0].replace(' ', '') 说明你这个正则没有匹配。你把这句拆开看下,title1 肯定是空list title1 = re.findall('<h1 id="video-title" title="(.*?)" class="video-title">', resp.text) print(title1) if len(title1)>0: title=title1[0].replace(' ', '') else: print("没有匹配")
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决 1
无用
评论打赏
分享
举报编辑记录

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

python 使用 pandas报错 list index out of range
2022-04-14 11:07

weixin_45070839的博客问题： import csv import matplotlib.pyplot as plt import pandas as pd ...代码正确,出现错误：IndexError: list index out of range 参考答案： https://blog.csdn.net/weixin_37746009/article/de.
【Python】成功解决Python报错：IndexError: list index out of range
2024-05-30 10:07

I'mAlex的博客成功解决Python报错：IndexError: list index out of ...`IndexError: list index out of range` 错误表明，Python解释器在尝试访问列表中不存在的索引时遇到了问题。换句话说，代码中试图访问的索引超出了列表的范围。
【Python】解决Python报错：IndexError: list index out of range
2024-06-07 13:51

E绵绵的博客通过理解列表的索引机制和确保在安全的范围内访问索引，可以有效预防和解决错误。希望本文提供的策略和实例能帮助你在日常编程中避免此类错误，编写更加健壮的Python代码。
【Python 已解决】列表索引超出范围–Python 中的IndexError: list index out of range 错误
2024-07-17 21:10

二川bro的博客【Python 已解决】列表索引超出范围–Python 中的IndexError: list index out of range 错误
【Python报错已解决】IndexError: list index out of range
2024-10-02 16:48

鸽芷咕的博客其中，IndexError: list index out of range是一个相当常见的报错，它常常让开发者感到头疼，尤其是在处理列表数据结构时。这个报错究竟是怎么产生的呢？又该如何去解决它呢？今天，我们就深入剖析这个问题，为...
【Python】成功解决IndexError: list index out of range
2024-03-10 10:19

高斯小哥的博客你是否曾被Python中的“IndexError: list index out of range”错误困扰？别担心，本文为你揭秘其背后原因，并提供三种高效解决方案：检查索引值、使用循环遍历列表和异常处理。让你轻松摆脱这一常见错误，从此编程...
Python爬虫中list index out of range解决方案
2022-04-12 01:13

遍历之外的博客在python爬取视频项目中出现list index out of range报错，错误解释为列表的索引分配超出列范围； python有序序列中字符串 str 、列表 list 、元组 tuple按索引取值的时，默认范围为 0 ~ len(有序序列)减1，计数从0...
【python报错已解决】“IndexError: list index out of range”
2024-08-22 22:14

鸽芷咕的博客你是否在处理Python列表时遇到了“IndexError: list index out of range”的错误？这个错误可能会让你的程序中断运行，让你感到困惑。别担心，这篇文章将为你解释这个错误的原因，并提供几种有效的解决方案。
【python报错已解决】`IndexError: list index out of range`
2024-08-14 08:18

鸽芷咕的博客 `IndexError: list index out of range`这个报错通常发生在尝试访问一个列表中不存在的索引时。比如，如果你有一个长度为5的列表，却尝试访问索引为5的元素，就会抛出这个异常。
批量爬取报错list index out of range发现根本原因不是正则语法错误，而是同一个网页可能存在多套css样式，每次请求得到的response是随机一套，这才导致正则语法匹配不上
2022-05-26 12:08

法学僧转行程序猿的博客批量爬取报错list index out of range发现根本原因不是正则语法错误。一个网页可能有好几套css，每次通过requests请求得到的都是其中随机的一套，所以只用一套正则有时就会取不到数据，从而产生空列表，进而报错list...
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
系统已结题 6月5日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
已采纳回答 5月28日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
赞助了问题酬金5元 5月26日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
修改了问题 5月26日
展开全部

批量爬取数据中报错list index out of range（索引本身没问题）怎么办

2条回答 默认 最新

问题事件

2条回答默认最新