问题1:gensim里LDA模型训练时的corpus参数什么意思?
lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=20, passes=60)
问题2:看到有人还结合了tf-idf来初始化corpus,为什么要用tf-idf呢?
tfidf_model = model.TfidfModel(corpus)
corpus_tfidf = tfidf_model[corpus]
问题1:gensim里LDA模型训练时的corpus参数什么意思?
lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=20, passes=60)
问题2:看到有人还结合了tf-idf来初始化corpus,为什么要用tf-idf呢?
tfidf_model = model.TfidfModel(corpus)
corpus_tfidf = tfidf_model[corpus]
corpus = [id2word.doc2bow(text) for text in string_list100] # 分别对每篇文章建立词袋向量
print(corpus[:1])
print([[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]])
输出形如:
[(0, 2), (1, 2)]
[('一侧', 2), ('一端', 2)]