m0_56062032 2024-03-22 19:03 采纳率: 65.4%
浏览 2

基于统计分析的电影数据处理

出现以下错误,关键词错误


KeyError                                  Traceback (most recent call last)
<ipython-input-16-7df382df59c7> in <module>()
     20 #__________________
     21 # load the dataset
---> 22 credits = load_tmdb_credits('D:/Datamovies/tmdb_5000_movies.csv')
     23 credits.head()
KeyError: 'cast'

import json
import pandas as pd
#___________________________
def load_tmdb_movies(path):
    df = pd.read_csv(path)
    df['release_date'] = pd.to_datetime(df['release_date']).apply(lambda x: x.date())
    json_columns = ['genres', 'keywords', 'production_countries',
                    'production_companies', 'spoken_languages']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
    return df
#___________________________
def load_tmdb_credits(path):
    df = pd.read_csv(path)
    json_columns = ['cast', 'crew']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
    return df
#___________________
LOST_COLUMNS = [
    'actor_1_facebook_likes',
    'actor_2_facebook_likes',
    'actor_3_facebook_likes',
    'aspect_ratio',
    'cast_total_facebook_likes',
    'color',
    'content_rating',
    'director_facebook_likes',
    'facenumber_in_poster',
    'movie_facebook_likes',
    'movie_imdb_link',
    'num_critic_for_reviews',
    'num_user_for_reviews']
#____________________________________
TMDB_TO_IMDB_SIMPLE_EQUIVALENCIES = {
    'budget': 'budget',
    'genres': 'genres',
    'revenue': 'gross',
    'title': 'movie_title',
    'runtime': 'duration',
    'original_language': 'language',
    'keywords': 'plot_keywords',
    'vote_count': 'num_voted_users'}
#_____________________________________________________
IMDB_COLUMNS_TO_REMAP = {'imdb_score': 'vote_average'}
#_____________________________________________________
def safe_access(container, index_values):
    # return missing value rather than an error upon indexing/key failure
    result = container
    try:
        for idx in index_values:
            result = result[idx]
        return result
    except IndexError or KeyError:
        return pd.np.nan
#_____________________________________________________
def get_director(crew_data):
    directors = [x['name'] for x in crew_data if x['job'] == 'Director']
    return safe_access(directors, [0])
#_____________________________________________________
def pipe_flatten_names(keywords):
    return '|'.join([x['name'] for x in keywords])
#_____________________________________________________
def convert_to_original_format(movies, credits):
    tmdb_movies = movies.copy()
    tmdb_movies.rename(columns=TMDB_TO_IMDB_SIMPLE_EQUIVALENCIES, inplace=True)
    tmdb_movies['title_year'] = pd.to_datetime(tmdb_movies['release_date']).apply(lambda x: x.year)
    # I'm assuming that the first production country is equivalent, but have not been able to validate this
    tmdb_movies['country'] = tmdb_movies['production_countries'].apply(lambda x: safe_access(x, [0, 'name']))
    tmdb_movies['language'] = tmdb_movies['spoken_languages'].apply(lambda x: safe_access(x, [0, 'name']))
    tmdb_movies['director_name'] = credits['crew'].apply(get_director)
    tmdb_movies['actor_1_name'] = credits['cast'].apply(lambda x: safe_access(x, [1, 'name']))
    tmdb_movies['actor_2_name'] = credits['cast'].apply(lambda x: safe_access(x, [2, 'name']))
    tmdb_movies['actor_3_name'] = credits['cast'].apply(lambda x: safe_access(x, [3, 'name']))
    tmdb_movies['genres'] = tmdb_movies['genres'].apply(pipe_flatten_names)
    tmdb_movies['plot_keywords'] = tmdb_movies['plot_keywords'].apply(pipe_flatten_names)
    return tmdb_movies


import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import math, nltk, warnings
from nltk.corpus import wordnet
from sklearn import linear_model
from sklearn.neighbors import NearestNeighbors
from fuzzywuzzy import fuzz
from wordcloud import WordCloud, STOPWORDS
plt.rcParams["patch.force_edgecolor"] = True
plt.style.use('fivethirtyeight')
mpl.rc('patch', edgecolor = 'dimgray', linewidth=1)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "last_expr"
pd.options.display.max_columns = 50
%matplotlib inline
warnings.filterwarnings('ignore')
PS = nltk.stem.PorterStemmer()
#__________________
# load the dataset
credits = load_tmdb_credits('D:/Datamovies/tmdb_5000_movies.csv')
credits.head()
  • 写回答

2条回答 默认 最新

  • 码农阿豪 Java领域优质创作者 2024-03-22 19:03
    关注
    让阿豪来帮你解答,本回答参考chatgpt3.5编写提供,如果还有疑问可以评论或留言
    该错误是KeyError,即键错误。出现这个错误的原因是在载入数据集时,在load_tmdb_credits函数中,json_columns中的一个键'cast'在数据中不存在。可能是由于数据集中缺失了这个键或者它被命名为其他名称。 为了解决这个错误,可以使用以下代码来在函数中增加一个检查:
    def load_tmdb_credits(path):
        df = pd.read_csv(path)
        json_columns = ['cast', 'crew']
        for column in json_columns:
            if column not in df.columns: # 注意这里的检查
                print(f"Error: {column} column not found in dataset.")
                return None
            df[column] = df[column].apply(json.loads)
        return df
    

    这样,在读取json_columns前,函数会检查当前DataFrame是否包含这些列,如果不存在,它会输出一个错误消息并返回None。这样,我们就可以在该错误出现时快速发现问题所在。 案例: 假设我们要读取一个文件mydata.csv,其中包含以下两列:'A'和'B',但是我们意外地删除了'B'列,现在我们尝试读取数据:

    import pandas as pd
    df = pd.read_csv('mydata.csv')
    print(df['B'])
    

    这将产生一个KeyError,因为'B'列不存在。我们现在可以使用上述代码将其修改为:

    import pandas as pd
    df = pd.read_csv('mydata.csv')
    if 'B' not in df.columns:
        print("Error: B column not found in dataset.")
    else:
        print(df['B'])
    

    这样,我们就可以得到一个错误消息并且知道问题所在。

    评论

报告相同问题?

问题事件

  • 创建了问题 3月22日

悬赏问题

  • ¥50 vue组件中无法正确接收并处理axios请求
  • ¥15 隐藏系统界面pdf的打印、下载按钮
  • ¥15 MATLAB联合adams仿真卡死如何解决(代码模型无问题)
  • ¥15 基于pso参数优化的LightGBM分类模型
  • ¥15 安装Paddleocr时报错无法解决
  • ¥15 python中transformers可以正常下载,但是没有办法使用pipeline
  • ¥50 分布式追踪trace异常问题
  • ¥15 人在外地出差,速帮一点点
  • ¥15 如何使用canvas在图片上进行如下的标注,以下代码不起作用,如何修改
  • ¥50 vue router 动态路由问题