大家好,我想请教一个有关使用nltk删除文件夹内所有txt文档的stopwords的问题,我的代码如下,现在报错了,想请教一下错在哪里了,以及代码还有没有其他不对的地方。
第二个问题是当我删除掉一些代码之后可以运行,但是文档内并没有删除任何stopwords。我的代码如下
import os
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text_path = r'D:\1.1 SEC EDGAR年报源文件 (10Q_10KA_10QA)\1995_SEC EDGAR年报 (10K_10KSB_10K405)\QTR3'
text_list = os.listdir(text_path)
for path in text_list:
with open(text_path + '\\' + path, 'r') as f:
result = f.read()
# add your own stop words to the corpus
new_stopwords = stopwords.words('english')
new_stopwords.append(['ME', 'MY', 'MYSELF', 'WE', 'OUR', 'OURS', 'OURSELVES', 'YOU', 'YOUR', 'YOURS',
'YOURSELF', 'YOURSELVES', 'HE', 'HIM', 'HIS', 'HIMSELF', 'SHE', 'HER', 'HERS', 'HERSELF',
'BEEN', 'BEING', 'HAVE', 'HAS', 'HAD', 'HAVING', 'DO', 'DOES', 'DID', 'DOING', 'AN',
'THE', 'AND', 'BUT', 'IF', 'OR', 'BECAUSE', 'AS', 'UNTIL', 'WHILE', 'OF', 'AT', 'BY',
'FOR', 'WITH', 'ABOUT', 'BETWEEN', 'INTO', 'THROUGH', 'DURING', 'BEFORE',
'AFTER', 'ABOVE', 'BELOW', 'TO', 'FROM', 'UP', 'DOWN', 'IN', 'OUT', 'ON', 'OFF', 'OVER',
'UNDER', 'AGAIN', 'FURTHER', 'THEN', 'ONCE', 'HERE', 'THERE', 'WHEN', 'WHERE', 'WHY',
'HOW', 'ALL', 'ANY', 'BOTH', 'EACH', 'FEW', 'MORE', 'MOST', 'OTHER', 'SOME', 'SUCH',
'NO', 'NOR', 'NOT', 'ONLY', 'OWN', 'SAME', 'SO', 'THAN', 'TOO', 'VERY', 'CAN',
'JUST', 'SHOULD', 'NOW', 'AMONG'])
# Bring in the default English NLTK stop words
# stoplist = stopwords.words('english')
# Define additional stopwords in a string
# add additional stop words seperated by spaces
additional_stopwords = """can ieee vol eta com may different less let raf cos will con real cat can't cant"""
# Split the the additional stopwords string on each word and then add
# those words to the NLTK stopwords list
new_stopwords += additional_stopwords.split()
# change loop dir to the FULL path of where all your .txt files reside
# change save path to a dir where you want your new stop word removed txt files saved
loop_dir = r'D:\1.1 SEC EDGAR年报源文件 (10Q_10KA_10QA)\1995_SEC EDGAR年报 (10K_10KSB_10K405)\QTR3'
save_dir = r'D:\1.1 SEC EDGAR年报源文件 (10Q_10KA_10QA)\1995_SEC EDGAR年报 (10K_10KSB_10K405)\QTR3-1'
# Open a file and read it into memory
for txt in os.listdir(loop_dir):
print(txt)
file = open(loop_dir + txt)
save_file = open(save_dir + txt, 'w')
text = file.read()
# Apply the stoplist to the text
cleaned = [word for word in text.split() if word not in new_stopwords]
save_file.writelines(["%s\n" % item for item in cleaned])
报错原因是
Traceback (most recent call last):
File "D:\7PycharmPythonProjects\PythonLessons\4 Stopwords_deletes 2.py", line 44, in <module>
file = open(loop_dir + txt)
FileNotFoundError: [Errno 2] No such file or directory: 'D:\\1.1 SEC EDGAR年报源文件 (10Q_10KA_10QA)\\1995_SEC EDGAR年报 (10K_10KSB_10K405)\\QTR319950703_10-K_edgar_data_731190_0000731190-95-000011.txt'
想请教一下这段代码的问题如何解决,以及我删除stopwords这段代码是否还有其他的潜在问题?谢谢大家!