wnalki 2018-04-01 11:58 采纳率: 0%
浏览 1406
已结题

LDA处理csv文件的时候出现编码格式问题

df=pd.read_csv("dataa.csv")
df.head()
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

n_features = 1000
tf_vectorizer = CountVectorizer(strip_accents ='unicode',max_features=n_features,stop_words='english',max_df = 0.5,min_df = 10)
tf = tf_vectorizer.fit_transform(df.content)


ValueError Traceback (most recent call last)
in ()
1 n_features = 1000
2 tf_vectorizer = CountVectorizer(strip_accents ='unicode',max_features=n_features,stop_words='english',max_df = 0.5,min_df = 10)
----> 3 tf = tf_vectorizer.fit_transform(df.content)

/home/wanghan/.local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
837
838 vocabulary, X = self._count_vocab(raw_documents,
--> 839 self.fixed_vocabulary_)
840
841 if self.binary:

/home/wanghan/.local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
760 for doc in raw_documents:
761 feature_counter = {}
--> 762 for feature in analyze(doc):
763 try:
764 feature_idx = vocabulary[feature]

/home/wanghan/.local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in (doc)
239
240 return lambda doc: self._word_ngrams(
--> 241 tokenize(preprocess(self.decode(doc))), stop_words)
242
243 else:

/home/wanghan/.local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in decode(self, doc)
119
120 if doc is np.nan:
--> 121 raise ValueError("np.nan is an invalid document, expected byte or "
122 "unicode string.")
123

ValueError: np.nan is an invalid document, expected byte or unicode string.

  • 写回答

2条回答 默认 最新

  • wnalki 2018-04-01 11:58
    关注

    ![图片说明](https://img-ask.csdn.net/upload/201804/01/1522583843_473408.png)<br> 图片说明

    评论

报告相同问题?

悬赏问题

  • ¥100 Jenkins自动化部署—悬赏100元
  • ¥15 关于#python#的问题:求帮写python代码
  • ¥20 MATLAB画图图形出现上下震荡的线条
  • ¥15 关于#windows#的问题:怎么用WIN 11系统的电脑 克隆WIN NT3.51-4.0系统的硬盘
  • ¥15 perl MISA分析p3_in脚本出错
  • ¥15 k8s部署jupyterlab,jupyterlab保存不了文件
  • ¥15 ubuntu虚拟机打包apk错误
  • ¥199 rust编程架构设计的方案 有偿
  • ¥15 回答4f系统的像差计算
  • ¥15 java如何提取出pdf里的文字?