关于#python#的问题：想问一下爬虫出现list index out of range怎么解决？

想问一下爬虫出现list index out of range怎么解决？li标签中有空值

报错如下

原始界面中有广告，使得中间存在空值

爬取的数据类型如下：

代码如下


import requests
import selector as selector
from lxml import etree
import re
url='https://cs.58.com/chuzu/'
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.58'
}

page_text=requests.get(url=url, headers=headers).text
tree=etree.HTML(page_text)
li_list=tree.xpath('//ul[@class="house-list"]/li')
fp=open('58.txt','w',encoding='utf-8')
i=0
for li in li_list:
            title=li.xpath('./div[2]/h2/a/text()')[0]
            print(title)
            fp.write(title+' ')
            room_type=li.xpath('./div[2]/p[@class="room"]/text()')[0]
            fp.write(room_type + ' ')
            print(room_type)
            # location1 = li.xpath('./div[2]/p[@class="room"]/text()')[0]
            location1 = li.xpath('./div[2]/p[@class="infor"]/a/text()')[0]
            fp.write(location1 + ' ')
            print(location1)
            location2 = li.xpath('./div[2]/p[@class="infor"]/a/text()')[1]
            fp.write(location2 + ' ')
            print(location2)
            position = li.xpath('./div[2]/p[@class="infor"]/text()')[0]
            fp.write(position + ' ')
            print(position)
            money = li.xpath('./div[3]/div[2]/b[@class="strongbox"]/text()')[0]
            fp.write(money + ' ')
            print(money)
            buy = li.xpath('./div[3]/div[2]/text()')[0]
            fp.write(buy + '\n')
            print(buy)
            i=i+1
            print(i)

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

4条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
GISer Liu 2024-02-12 21:05
关注
该回答引用自GPT-3.5,由博主GIS_Liu编写：

问题分析：

报错信息显示是“list index out of range”，这通常是由于尝试访问列表中不存在的索引位置导致的。在你的代码中，这个错误可能是由于以下几个地方引起的：

在提取网页内容时，某些标签可能不存在，导致XPath表达式无法匹配到任何元素，进而尝试访问空列表的索引位置。
网页内容结构可能不是每个 li 标签都包含完整的信息，例如部分 li 标签中缺少某些元素，但你的代码尝试访问了这些不存在的元素。
在处理位置信息时，可能出现部分 li 标签中的位置信息与其他标签结构不一致，导致 XPath 表达式无法匹配到正确的元素，进而尝试访问空列表的索引位置。

解决方案：

在提取 li 标签内容时，可以使用 try-except 语句捕获异常，避免程序因为某个 li 标签内容不完整而终止执行。
在提取每个字段内容时，先判断该字段是否存在，若不存在则赋予默认值或者跳过当前循环。
在处理位置信息时，可以针对不同的 li 标签结构编写不同的 XPath 表达式，以确保准确提取位置信息。

下面是修改后的代码：

import requests from lxml import etree url = 'https://cs.58.com/chuzu/' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.58' } page_text = requests.get(url=url, headers=headers).text tree = etree.HTML(page_text) li_list = tree.xpath('//ul[@class="house-list"]/li') fp = open('58.txt', 'w', encoding='utf-8') for li in li_list: try: title = li.xpath('./div[2]/h2/a/text()')[0] room_type = li.xpath('./div[2]/p[@class="room"]/text()')[0] location1 = li.xpath('./div[2]/p[@class="infor"]/a/text()')[0] location2 = li.xpath('./div[2]/p[@class="infor"]/a/text()')[1] position = li.xpath('./div[2]/p[@class="infor"]/text()')[0].strip() # 增加strip()去除空白字符 money = li.xpath('./div[3]/div[2]/b[@class="strongbox"]/text()')[0] buy = li.xpath('./div[3]/div[2]/text()')[0] fp.write(f"{title} {room_type} {location1} {location2} {position} {money} {buy}\n") except IndexError: # 异常处理：索引超出范围时跳过当前循环 continue fp.close()

这样修改后的代码可以避免由于空值导致的索引超出范围的错误。

如果该回答解决了您的问题，请采纳！如果没有，请详述问题所在！
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

出现list index out of range错误如何解决？ python
2019-10-31 09:34

回答 2 已采纳 L1 数组值先打印一下，直接用L1[1],L1[2] 下标 1、2 超长说明 L1 的长度可能不是3.
爬虫学习时错误IndexError: list index out of range（列表索引超出范围） python 有问必答
2021-10-01 17:23

回答 1 已采纳去掉break。。要不for下面的append不生效。。而且item是node，需要转为字符串，取消注释item = str(item) 改完上面2步骤后可以了，正常采集有帮助麻烦点个采纳【本回
Python爬虫时，更换网址后，显示list index out of range，问题出在哪？应如何解决？ list python
2020-08-19 10:41

回答 6 已采纳 allpage 时空的，新的网页里可能没有你要findall的数据
Python爬虫中list index out of range解决方案
2022-04-12 01:13

遍历之外的博客在python爬取视频项目中出现list index out of range报错，错误解释为列表的索引分配超出列范围； python有序序列中字符串 str 、列表 list 、元组 tuple按索引取值的时，默认范围为 0 ~ len(有序序列)减1，计数从0...
python爬取证监会行政监管措施遭遇 IndexError: list index out of range python 有问必答爬虫
2022-01-25 13:06

回答 3 已采纳 http://www.csrc.gov.cn/searchList/58959eb1bd68458088cac63f46a5fa40?_isAgg=true&_isJson=true&_pageSiz
批量爬取数据中报错list index out of range（索引本身没问题）怎么办 pycharm python 爬虫
2022-05-25 19:20

回答 2 已采纳你是这句报的错， title = re.findall('<h1 id="video-title" title="(.*?)" class="video-title">', resp.te
爬虫时显示报错：IndexError: list index out of range python 爬虫
2022-11-19 19:13

回答 2 已采纳这一行：for tr in soup.find('body').children: 中的 'body' 改为 'tbody'.
python爬虫“indexerror: list index out of range”错误及其解决办法
2023-06-08 07:00

木木em哈哈的博客 python列表为空的原因导致索引错误，继而导致找不到索引不要图省事，至少在报错的时候最好用最基础的方法试一遍。
关于#pythonscrapy#的问题，如何解决？ python 开发语言爬虫
2023-04-02 16:26

回答 2 已采纳好问题！！抱歉我也不太懂，你问问chatGPT吧：https://new.quke123.com/ 或者其他Python群友：https://app.yinxiang.com
关于#python#的问题：用Python爬取网页时，直接运行for语句下的代码可正常运行，加入for语句进行循环则报错 python 爬虫
2023-01-04 12:32

回答 3 已采纳 01.html和1.html很显然不是同一个网址，你在错误的网址下当然抓不到东西，是空的改成 target = f"http://paper.people.com.cn/rmrb/html/20{ye
爬虫遇到了问题：name 'headers' is not defined，请问如何解决？ python 爬虫
2022-09-22 17:19

回答 3 已采纳缩进有问题： from lxml import etree import requests import csv import time def spider(): headers = {
python提示list index out of range_python爬虫提醒IndexError: list index out of range
2020-12-09 10:17

weixin_39655689的博客最近在写一个爬虫程序，但是调用main()就不停的提示IndexError: list index out of range可是在子函数进行测试的时候明明是不存在这个问题的代码如下from selenium import webdriverfrom lxml import etreefrom ...
Python爬虫配合VPN爬取出现报错 python 爬虫
2021-12-22 17:33

回答 1 已采纳你这个是VPN代理问题，你可以将VPN设置成部分代理，不要全部代理你的网络。
【Python爬虫】报错解决：IndexError: list index out of range
2021-12-12 14:54

mafumafu2018的博客 00.背景在爬取某招聘网站的时候，试图用request获得url的文本，然后用正则表达式匹配相关信息，但是！人家变成动态页面，无法再用普通的静态页面的方法去获取。 01.报错问题 ...IndexError: list ind...
【Python 已解决】列表索引超出范围–Python 中的IndexError: list index out of range 错误
2024-07-17 21:10

二川bro的博客【Python 已解决】列表索引超出范围–Python 中的IndexError: list index out of range 错误
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
修改了问题 2月12日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 2月12日

悬赏问题

¥15 两条数据合并成一条数据
¥15 Mac电脑安装了Charles，证书已经信任，电脑还是没网，找小伙伴帮看一下
¥15 Ubuntu虚拟机设置
¥15 comsol三维模型中磁场为什么没有“速度（洛伦兹项）”这一选项
¥15 electron 如何实现自定义安装界面
¥15 关于#linux#的问题：子进程C运行“ls –l”命令，且显示“C运行ls-l命令”(语言-c语言)
¥15 vs code配置c语言遇到这个问题
¥15 vscode调试编译找不到gcc，只有cl，但是检查cmd是对的，控制面板的路径也更改了
¥20 access中怎么分割分别获取一下图中的值
¥15 keras_tcn已经安装成功，还是显示ModuleNotFoundError: No module named 'keras_tcn'

关于#python#的问题：想问一下爬虫出现list index out of range怎么解决？

4条回答 默认 最新

问题事件

悬赏问题

4条回答默认最新