qq_52091435 2022-03-06 21:38 采纳率: 83.3%
浏览 13
已结题

使用nltk删除文件夹内所有txt文档的stopwords

大家好,我想请教一个有关使用nltk删除文件夹内所有txt文档的stopwords的问题,我的代码如下,现在报错了,想请教一下错在哪里了,以及代码还有没有其他不对的地方。
第二个问题是当我删除掉一些代码之后可以运行,但是文档内并没有删除任何stopwords。我的代码如下

import os
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text_path = r'D:\1.1 SEC EDGAR年报源文件 (10Q_10KA_10QA)\1995_SEC EDGAR年报 (10K_10KSB_10K405)\QTR3'
text_list = os.listdir(text_path)
for path in text_list:
    with open(text_path + '\\' + path, 'r') as f:
        result = f.read()

# add your own stop words to the corpus
new_stopwords = stopwords.words('english')
new_stopwords.append(['ME', 'MY', 'MYSELF', 'WE', 'OUR', 'OURS', 'OURSELVES', 'YOU', 'YOUR', 'YOURS',
                      'YOURSELF', 'YOURSELVES', 'HE', 'HIM', 'HIS', 'HIMSELF', 'SHE', 'HER', 'HERS', 'HERSELF',
                      'BEEN', 'BEING', 'HAVE', 'HAS', 'HAD', 'HAVING', 'DO', 'DOES', 'DID', 'DOING', 'AN',
                      'THE', 'AND', 'BUT', 'IF', 'OR', 'BECAUSE', 'AS', 'UNTIL', 'WHILE', 'OF', 'AT', 'BY',
                      'FOR', 'WITH', 'ABOUT', 'BETWEEN', 'INTO', 'THROUGH', 'DURING', 'BEFORE',
                      'AFTER', 'ABOVE', 'BELOW', 'TO', 'FROM', 'UP', 'DOWN', 'IN', 'OUT', 'ON', 'OFF', 'OVER',
                      'UNDER', 'AGAIN', 'FURTHER', 'THEN', 'ONCE', 'HERE', 'THERE', 'WHEN', 'WHERE', 'WHY',
                      'HOW', 'ALL', 'ANY', 'BOTH', 'EACH', 'FEW', 'MORE', 'MOST', 'OTHER', 'SOME', 'SUCH',
                      'NO', 'NOR', 'NOT', 'ONLY', 'OWN', 'SAME', 'SO', 'THAN', 'TOO', 'VERY', 'CAN',
                      'JUST', 'SHOULD', 'NOW', 'AMONG'])

# Bring in the default English NLTK stop words
# stoplist = stopwords.words('english')

# Define additional stopwords in a string
# add additional stop words seperated by spaces
additional_stopwords = """can ieee vol eta com may different less let raf cos will con real cat can't cant"""

# Split the the additional stopwords string on each word and then add
# those words to the NLTK stopwords list
new_stopwords += additional_stopwords.split()

# change loop dir to the FULL path of where all your .txt files reside
# change save path to a dir where you want your new stop word removed txt files saved
loop_dir = r'D:\1.1 SEC EDGAR年报源文件 (10Q_10KA_10QA)\1995_SEC EDGAR年报 (10K_10KSB_10K405)\QTR3'
save_dir = r'D:\1.1 SEC EDGAR年报源文件 (10Q_10KA_10QA)\1995_SEC EDGAR年报 (10K_10KSB_10K405)\QTR3-1'

# Open a file and read it into memory
for txt in os.listdir(loop_dir):
    print(txt)
    file = open(loop_dir + txt)
    save_file = open(save_dir + txt, 'w')
    text = file.read()

    # Apply the stoplist to the text
    cleaned = [word for word in text.split() if word not in new_stopwords]

    save_file.writelines(["%s\n" % item for item in cleaned])


报错原因是

Traceback (most recent call last):
  File "D:\7PycharmPythonProjects\PythonLessons\4 Stopwords_deletes 2.py", line 44, in <module>
    file = open(loop_dir + txt)
FileNotFoundError: [Errno 2] No such file or directory: 'D:\\1.1 SEC EDGAR年报源文件 (10Q_10KA_10QA)\\1995_SEC EDGAR年报 (10K_10KSB_10K405)\\QTR319950703_10-K_edgar_data_731190_0000731190-95-000011.txt'

想请教一下这段代码的问题如何解决,以及我删除stopwords这段代码是否还有其他的潜在问题?谢谢大家!

  • 写回答

1条回答 默认 最新

  • 陈年椰子 2022-03-07 08:23
    关注

    这个语句,没有产生正确的文件路径吧?

    file = open(loop_dir + txt)
    

    改成这样试试

    file = open(loop_dir +"/"+ txt)
    
    for txt in os.listdir(loop_dir):
        print(txt)
        file = open(loop_dir  +"/"+ txt)    
        text = file.read() 
        # Apply the stoplist to the text
        cleaned = [word for word in text.split() if word not in new_stopwords]
        file.close()
        save_file = open(save_dir  +"/"+ txt, 'w')
     
        save_file.writelines(["%s\n" % item for item in cleaned])
        save_file.close()
    
    
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论 编辑记录

报告相同问题?

问题事件

  • 系统已结题 3月15日
  • 已采纳回答 3月7日
  • 创建了问题 3月6日

悬赏问题

  • ¥50 有数据,怎么建立模型求影响全要素生产率的因素
  • ¥50 有数据,怎么用matlab求全要素生产率
  • ¥15 TI的insta-spin例程
  • ¥15 完成下列问题完成下列问题
  • ¥15 C#算法问题, 不知道怎么处理这个数据的转换
  • ¥15 YoloV5 第三方库的版本对照问题
  • ¥15 请完成下列相关问题!
  • ¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像,如何解决?
  • ¥15 求daily translation(DT)偏差订正方法的代码
  • ¥15 js调用html页面需要隐藏某个按钮