qq_52091435 2024-03-09 15:14 采纳率: 83.3%
浏览 6
已结题

KeyError: '\x80' 问题如何解决

我正在pycharm运行一个网上的文本分析代码,这个代码是把数据中的公司名称进行格式统一,方便进行bing WebsearchAPI,在运行过程中,出现了错误,错误代码如下:

"D:\Software Download\Python 12\python.exe" "D:\Software Download\Python Codes\clean neame\patentsview_process_name.py" 
D:\Software Download\Python Codes\clean neame\patentsview_process_name.py:43: SyntaxWarning: invalid escape sequence '\S'
  '''
Traceback (most recent call last):
  File "D:\Software Download\Python Codes\clean neame\patentsview_process_name.py", line 181, in <module>
    newchar_list.append(dict_replace[char])
                        ~~~~~~~~~~~~^^^^^^
KeyError: '\x80'

Process finished with exit code 1

我尝试了一些网上解决\80问题的代码,仍然报同样的错误。我个人分析,原因应该是出现在.json文件上,源代码中有调用.json的文件,但那个文件里面只是一堆字符,方便程序读取并进行删除的,大概长这样[](

{"!": " ", "#": "#", "$": "s", "%": "%", "&": " & ", "'": "'", "(": " ", ")": " ", "*": "*"}

我不知道哪里出问题了,想麻烦大家帮忙解决一下。谢谢大家!有关这个问题的源代码我已经贴在下面了,或者,可以去https://github.com/danielm-github/patentsmatch_bingsearchapproach看一下所有的相关代码,里面有那个.json的文件,也有源文件,名字叫clean_name/patentsview_process_name.py

再次感谢!!


# -*- coding: utf-8 -*-
"""
Created on Sat Jul  6 22:04:33 2019

@author: Danqing Mei
"""

import pandas as pd
import re
import html
import json
import my_own_handy_functions as mf

rawassignee = pd.read_stata(r"D:\2 Project 2\Project 2 - Sample and Data\USPTO initial data\rawassignee_noquote.dta") # patent number-assignee name file from patentsview
rawassignee = rawassignee.loc[rawassignee['dummy_raw_org']==1]
rawassignee_nodup = rawassignee.drop_duplicates(['raw_assignee_organization'], keep='first', inplace=False)
list_raworg_nodup = list(rawassignee_nodup['raw_assignee_organization'])
list_patentid_nodup = list(rawassignee_nodup['patent_id'])

list_cleanorg = []
for i in range(0, len(list_raworg_nodup)):
    raw_name = list_raworg_nodup[i]
    # this unescape takes care most of the "&"
    clean_name = html.unescape(raw_name)
    # some exceptions below
    if '&Circlesolid;' not in clean_name and '&thgr;' not in clean_name and '&dgr;' not in clean_name:
        list_cleanorg.append(clean_name.lower())
    else:
        clean_name = clean_name.replace('&Circlesolid;', ' ')
        clean_name = clean_name.replace('&thgr;', 'o') #'\u03B8', actually should be 'o'
        clean_name = clean_name.replace('&dgr;', '-') # '\u03B4', actually should be '-'
        list_cleanorg.append(clean_name.lower())

# take care of char ";"
checkfmt = re.compile(r'\d+;') # at least one digit followed by a ";"
for i in range(0, len(list_cleanorg)):
    name = list_cleanorg[i]
    match = re.search(checkfmt, name, flags=0)
    if match:
        name = name.replace(match.group(0), '')
        list_cleanorg[i] = name

'''
# check special format about ;
# indeed first check all names containing ;
# then use the following regex

checkfmt = re.compile(r'(^|\S+);\S+') 
# begin of the string or any non-white space (one or more) + ; + any non-white space (one or more)
for i in range(0,len(list_cleanorg)):
    name = list_cleanorg[i]
    match = re.search(checkfmt, name)
    if match:
        print( name + ' ' + list_patentid_nodup[i])
'''

# These are stuff need to take care
err_fmt = ['f;vis', ';3m', ';bri', 'hô ;', 'sil;verbrook', 'el;ectronics', 'people;s', 's;p.a.', 'co;,']
crr_fmt = ['vis'  , '3m' , 'bri' , 'hô'  , 'silverbrook' , 'electronics' , 'people\'s','s.p.a.', 'co,' ]

for i in range(0, len(list_cleanorg)):
    name = list_cleanorg[i]
    for j in range(0, len(err_fmt)):
        err = err_fmt[j]
        crr = crr_fmt[j]
        if err in name:
            newname = name.replace(err, crr)
            list_cleanorg[i] = newname
        

post = r"( |\()a corp.*of.*$" # take care of "a corp... of..."
post_re = re.compile(post)
for i in range(0, len(list_cleanorg)):
    name = list_cleanorg[i]
    newname = post_re.sub('',name)
    list_cleanorg[i] = newname

'''
# get a dictionary of all char in the assignee names to check later

dict_clean_char = {}
for i in range(0,len(list_patentid_nodup)):
    name = list_cleanorg[i]
    for char in name:
        if char != " ":
            patent_id = list_patentid_nodup[i]
            if char not in dict_clean_char:
                dict_clean_char[char] = {patent_id:name}
            else:
                dict_clean_char[char].update({patent_id:name})
    if i % 10000 == 0:
        print(i)

with open('dict_clean_char.pickle', 'wb') as handle:
    pickle.dump(dict_clean_char, handle, protocol = pickle.HIGHEST_PROTOCOL)

with open('dict_clean_char.pickle', 'rb') as handle:
    dict_clean_char = pickle.load(handle)

list_char = list(dict_clean_char.keys())
list_char.sort()
'''

# dict_replace gives the correct char to replace the old one
with open('dict_char_replace.json', 'r') as f:
    dict_replace = json.load(f)

# change ., to space
for i in range(0, len(list_cleanorg)):
    name = list_cleanorg[i]
    if '.,' in name:
        newname = name.replace('.,', ' ')
        list_cleanorg[i] = newname

##### below to find x.x.x.x.x.x.x from 10 x(s) to 3 x(s) #####################
def find_pattern(name):
    for i in range(10,1,-1):
        temp_re = re.compile('\\b(\\w)' + i*'\\.(\\w)\\b')
        m = re.search(temp_re, name)
        if m:
            print(name)
            print(m.group(0))
            return m.group(0)

def fix_pattern(name, i): # i from 10 to 1
    temp_re = re.compile('\\b(\\w)' + i*'\\.(\\w)\\b') # means x.x.x... (from 11x to 2x)
    m = re.search(temp_re, name)
    if m:
        new_re = ''.join(ele for ele in ['\\' + str(j) for j in range(1, i+1+1)])
        # for example, when i = 5, new_re = r"\1\2\3\4\5\6"
        newname = temp_re.sub(new_re, name)
        return newname
    else:
        return name
        
n = 0
for i in range(0, len(list_cleanorg)):
    name = list_cleanorg[i]
    newname = list_cleanorg[i]
    for n_x in range(10, 0, -1):
        newname = fix_pattern(newname, n_x)
    if newname != name:
        n+=1        
        list_cleanorg[i] = newname
############################################################################
################ begin to take care of {} #################################
match_re = re.compile(r"{.*over.*\((.)\)}")

'''
check all these strange {  over ()} cases
for patentid, name in dict_clean_char[list_char[62]].items():
    m = re.search(match_re, name)
    if m:
        if m.group(1) == ' ':
            print(patentid)
            print(name)
            print(m.group(1))
'''

n=0
for i in range(0, len(list_cleanorg)):
    name = list_cleanorg[i]
    m = re.search(match_re, name)
    if m:
        if m.group(1) == ' ':
            replace_char = ''
        else:
            replace_char = m.group(1)
        newname = re.sub(match_re, replace_char, name)
        list_cleanorg[i] = newname
        n+=1
##########################################################################

##### clean every char to correct ones ##############################
list_cleanorg_afcharc = []
for i in range(0, len(list_cleanorg)):
    name = list_cleanorg[i]
    newchar_list = []
    for char in name:
        if char != ' ':
            newchar_list.append(dict_replace[char])
        else:
            newchar_list.append(' ')
    newname = ''.join(newchar for newchar in newchar_list)
    list_cleanorg_afcharc.append(newname)
######################################################
    
# process dot a bit more carefully because .com or .net cannot replace dot as space, dont have meaningful search results
dot2replace_re = re.compile(r"(\. )|\.$|^\.") # dot space or dot at the end of the string or dot at beg
for i in range(0, len(list_cleanorg_afcharc)):
    name = list_cleanorg_afcharc[i]
    newname = dot2replace_re.sub(' ', name)
    list_cleanorg_afcharc[i] = newname 

white0 = r" +" # >=1 whitespace 
white0_re = re.compile(white0)
for i in range(0, len(list_cleanorg_afcharc)):
    name = list_cleanorg_afcharc[i]
    newname = white0_re.sub(' ', name)
    list_cleanorg_afcharc[i] = newname

white1 = r"^ | $" # begin or end with whitespace
white1_re = re.compile(white1)
for i in range(0, len(list_cleanorg_afcharc)):
    name = list_cleanorg_afcharc[i]
    newname = white1_re.sub('',name)
    list_cleanorg_afcharc[i] = newname

# take care of u s, u s a
usa_re = re.compile(r"\b(u) \b(s) \b(a)\b")
us_re = re.compile(r"\b(u) \b(s)\b")
for i in range(0, len(list_cleanorg_afcharc)):
    name = list_cleanorg_afcharc[i]
    newname = usa_re.sub('usa', name)
    newname = us_re.sub('us', newname)
    list_cleanorg_afcharc[i] = newname

# take care of "a l'energie"
temp_re = re.compile(r"\ba *l'* *energie")
for i in range(0, len(list_cleanorg_afcharc)):
    name = list_cleanorg_afcharc[i]
    newname = temp_re.sub("a l'energie", name)
    list_cleanorg_afcharc[i] = newname

###############################################################
dict_raw2new = {}
for i in range(0, len(list_raworg_nodup)):
    rawname = list_raworg_nodup[i]
    newname = list_cleanorg_afcharc[i]
    dict_raw2new.update({rawname: newname})
mf.pickle_dump(dict_raw2new, 'dict_pv_raw2new')

dict_new2raw = {}
for i in range(0, len(list_raworg_nodup)):
    rawname = list_raworg_nodup[i]
    newname = list_cleanorg_afcharc[i]
    if newname not in dict_new2raw:
        dict_new2raw[newname] = {rawname}
    else:
        dict_new2raw[newname].update({rawname})    
mf.pickle_dump(dict_new2raw, 'dict_pv_new2raw')

  • 写回答

4条回答 默认 最新

  • 关注

    qq_52091435 下午好🌅🌅🌅
    本答案参考ChatGPT-3.5

    你遇到的 KeyError: '\x80' 错误通常表示在尝试访问字典中一个不存在的键时发生的异常。在这个情况下,错误发生在以下代码行:

    newchar_list.append(dict_replace[char])
    

    这里的 dict_replace 是从 'dict_char_replace.json' 文件加载的一个字典,而 char 是来自文本中的某个字符。当 char 的值为 \x80 时,在 dict_replace 中找不到对应的键。

    解决这个问题的方法如下:

    1. 检查 JSON 文件

      • 验证 'dict_char_replace.json' 文件是否正确包含了所有可能出现的字符映射关系,包括 \x80 这个字符。
      • 如果需要将 \x80 映射到另一个字符或删除它,请确保在 JSON 文件中添加相应的条目,例如:"\u0080": "replacement_char"
    2. 处理未知字符

      • 在尝试从字典获取字符映射之前,可以先检查 char 是否存在于字典中,如果不存在,则采取默认行为(如保留原字符或替换为空格):
        if char in dict_replace:
            newchar_list.append(dict_replace[char])
        else:
            # 对于不在字典中的字符,可以选择忽略、替换为特定字符或者记录日志等操作
            newchar_list.append(char)  # 默认保留原字符
        
    3. 编码问题

      • 确保原始数据和代码文件使用相同的字符编码,以防因为编码不一致导致无法识别某些特殊字符。
    4. 清理输入数据

      • 在对文本进行处理之前,考虑对可能包含非标准字符的数据进行预处理,将其转换为可识别并已在 dict_char_replace.json 中定义了对应关系的字符集。

    总之,首先检查 JSON 文件以确保包含了 \x80 字符的映射关系,并在程序中加入适当的逻辑来处理字典中未找到的键。

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(3条)

报告相同问题?

问题事件

  • 系统已结题 3月17日
  • 已采纳回答 3月9日
  • 创建了问题 3月9日

悬赏问题

  • ¥15 求解决扩散模型代码问题
  • ¥15 工创大赛太阳能电动车项目零基础要学什么
  • ¥20 limma多组间分析最终p值只有一个
  • ¥15 nopCommerce开发问题
  • ¥15 torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGKILL
  • ¥15 QuartusⅡ15.0编译项目后,output_files中的.jdi、.sld、.sof不更新怎么解决
  • ¥15 pycharm输出和导师的一样,但是标红
  • ¥15 想问问富文本拿到的html怎么转成docx的
  • ¥15 我看了您的文章,遇到了个问题。
  • ¥15 GitHubssh虚拟机连接不上