qq_52091435 2022-03-06 21:38 采纳率: 83.3%
浏览 13
已结题

使用nltk删除文件夹内所有txt文档的stopwords

大家好,我想请教一个有关使用nltk删除文件夹内所有txt文档的stopwords的问题,我的代码如下,现在报错了,想请教一下错在哪里了,以及代码还有没有其他不对的地方。
第二个问题是当我删除掉一些代码之后可以运行,但是文档内并没有删除任何stopwords。我的代码如下

import os
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text_path = r'D:\1.1 SEC EDGAR年报源文件 (10Q_10KA_10QA)\1995_SEC EDGAR年报 (10K_10KSB_10K405)\QTR3'
text_list = os.listdir(text_path)
for path in text_list:
    with open(text_path + '\\' + path, 'r') as f:
        result = f.read()

# add your own stop words to the corpus
new_stopwords = stopwords.words('english')
new_stopwords.append(['ME', 'MY', 'MYSELF', 'WE', 'OUR', 'OURS', 'OURSELVES', 'YOU', 'YOUR', 'YOURS',
                      'YOURSELF', 'YOURSELVES', 'HE', 'HIM', 'HIS', 'HIMSELF', 'SHE', 'HER', 'HERS', 'HERSELF',
                      'BEEN', 'BEING', 'HAVE', 'HAS', 'HAD', 'HAVING', 'DO', 'DOES', 'DID', 'DOING', 'AN',
                      'THE', 'AND', 'BUT', 'IF', 'OR', 'BECAUSE', 'AS', 'UNTIL', 'WHILE', 'OF', 'AT', 'BY',
                      'FOR', 'WITH', 'ABOUT', 'BETWEEN', 'INTO', 'THROUGH', 'DURING', 'BEFORE',
                      'AFTER', 'ABOVE', 'BELOW', 'TO', 'FROM', 'UP', 'DOWN', 'IN', 'OUT', 'ON', 'OFF', 'OVER',
                      'UNDER', 'AGAIN', 'FURTHER', 'THEN', 'ONCE', 'HERE', 'THERE', 'WHEN', 'WHERE', 'WHY',
                      'HOW', 'ALL', 'ANY', 'BOTH', 'EACH', 'FEW', 'MORE', 'MOST', 'OTHER', 'SOME', 'SUCH',
                      'NO', 'NOR', 'NOT', 'ONLY', 'OWN', 'SAME', 'SO', 'THAN', 'TOO', 'VERY', 'CAN',
                      'JUST', 'SHOULD', 'NOW', 'AMONG'])

# Bring in the default English NLTK stop words
# stoplist = stopwords.words('english')

# Define additional stopwords in a string
# add additional stop words seperated by spaces
additional_stopwords = """can ieee vol eta com may different less let raf cos will con real cat can't cant"""

# Split the the additional stopwords string on each word and then add
# those words to the NLTK stopwords list
new_stopwords += additional_stopwords.split()

# change loop dir to the FULL path of where all your .txt files reside
# change save path to a dir where you want your new stop word removed txt files saved
loop_dir = r'D:\1.1 SEC EDGAR年报源文件 (10Q_10KA_10QA)\1995_SEC EDGAR年报 (10K_10KSB_10K405)\QTR3'
save_dir = r'D:\1.1 SEC EDGAR年报源文件 (10Q_10KA_10QA)\1995_SEC EDGAR年报 (10K_10KSB_10K405)\QTR3-1'

# Open a file and read it into memory
for txt in os.listdir(loop_dir):
    print(txt)
    file = open(loop_dir + txt)
    save_file = open(save_dir + txt, 'w')
    text = file.read()

    # Apply the stoplist to the text
    cleaned = [word for word in text.split() if word not in new_stopwords]

    save_file.writelines(["%s\n" % item for item in cleaned])


报错原因是

Traceback (most recent call last):
  File "D:\7PycharmPythonProjects\PythonLessons\4 Stopwords_deletes 2.py", line 44, in <module>
    file = open(loop_dir + txt)
FileNotFoundError: [Errno 2] No such file or directory: 'D:\\1.1 SEC EDGAR年报源文件 (10Q_10KA_10QA)\\1995_SEC EDGAR年报 (10K_10KSB_10K405)\\QTR319950703_10-K_edgar_data_731190_0000731190-95-000011.txt'

想请教一下这段代码的问题如何解决,以及我删除stopwords这段代码是否还有其他的潜在问题?谢谢大家!

  • 写回答

1条回答 默认 最新

  • 陈年椰子 2022-03-07 08:23
    关注

    这个语句,没有产生正确的文件路径吧?

    file = open(loop_dir + txt)
    

    改成这样试试

    file = open(loop_dir +"/"+ txt)
    
    for txt in os.listdir(loop_dir):
        print(txt)
        file = open(loop_dir  +"/"+ txt)    
        text = file.read() 
        # Apply the stoplist to the text
        cleaned = [word for word in text.split() if word not in new_stopwords]
        file.close()
        save_file = open(save_dir  +"/"+ txt, 'w')
     
        save_file.writelines(["%s\n" % item for item in cleaned])
        save_file.close()
    
    
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论 编辑记录

报告相同问题?

问题事件

  • 系统已结题 3月15日
  • 已采纳回答 3月7日
  • 创建了问题 3月6日

悬赏问题

  • ¥15 FOC simulink
  • ¥15 咨询一下有关于王者荣耀赢藏战绩
  • ¥100 求购一套带接口实现实习自动签到打卡
  • ¥50 MacOS 使用虚拟机安装k8s
  • ¥500 亚马逊 COOKIE我如何才能实现 登录一个亚马逊账户 下发新 COOKIE ..我使用下发新COOKIE 导入ADS 指纹浏览器登录,我把账户密码 修改过后,原来下发新COOKIE 不会失效的方式
  • ¥20 玩游戏gpu和cpu利用率特别低,玩游戏卡顿
  • ¥25 oracle中的正则匹配
  • ¥15 关于#vscode#的问题:把软件卸载不会再出现蓝屏
  • ¥15 vimplus出现的错误
  • ¥15 usb无线网卡转typec口