wnalki 2018-04-01 11:58 采纳率: 0%
浏览 1406
已结题

LDA处理csv文件的时候出现编码格式问题

df=pd.read_csv("dataa.csv")
df.head()
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

n_features = 1000
tf_vectorizer = CountVectorizer(strip_accents ='unicode',max_features=n_features,stop_words='english',max_df = 0.5,min_df = 10)
tf = tf_vectorizer.fit_transform(df.content)


ValueError Traceback (most recent call last)
in ()
1 n_features = 1000
2 tf_vectorizer = CountVectorizer(strip_accents ='unicode',max_features=n_features,stop_words='english',max_df = 0.5,min_df = 10)
----> 3 tf = tf_vectorizer.fit_transform(df.content)

/home/wanghan/.local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
837
838 vocabulary, X = self._count_vocab(raw_documents,
--> 839 self.fixed_vocabulary_)
840
841 if self.binary:

/home/wanghan/.local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
760 for doc in raw_documents:
761 feature_counter = {}
--> 762 for feature in analyze(doc):
763 try:
764 feature_idx = vocabulary[feature]

/home/wanghan/.local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in (doc)
239
240 return lambda doc: self._word_ngrams(
--> 241 tokenize(preprocess(self.decode(doc))), stop_words)
242
243 else:

/home/wanghan/.local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in decode(self, doc)
119
120 if doc is np.nan:
--> 121 raise ValueError("np.nan is an invalid document, expected byte or "
122 "unicode string.")
123

ValueError: np.nan is an invalid document, expected byte or unicode string.

  • 写回答

2条回答 默认 最新

  • wnalki 2018-04-01 11:58
    关注

    ![图片说明](https://img-ask.csdn.net/upload/201804/01/1522583843_473408.png)<br> 图片说明

    评论

报告相同问题?

悬赏问题

  • ¥50 易语言把MYSQL数据库中的数据添加至组合框
  • ¥20 求数据集和代码#有偿答复
  • ¥15 关于下拉菜单选项关联的问题
  • ¥20 java-OJ-健康体检
  • ¥15 rs485的上拉下拉,不会对a-b<-200mv有影响吗,就是接受时,对判断逻辑0有影响吗
  • ¥15 使用phpstudy在云服务器上搭建个人网站
  • ¥15 应该如何判断含间隙的曲柄摇杆机构,轴与轴承是否发生了碰撞?
  • ¥15 vue3+express部署到nginx
  • ¥20 搭建pt1000三线制高精度测温电路
  • ¥15 使用Jdk8自带的算法,和Jdk11自带的加密结果会一样吗,不一样的话有什么解决方案,Jdk不能升级的情况