weixin_39789979
2020-12-26 07:48 阅读 7

GENSIM KeyedVectors and downloadable Models

It appears that when I download any model from the downloader api in gensim or saved a Word2Vec and re-load it using a KeyedVectors format, the vocab object is storing a reverse index in the "count" variable. So for example, if I have 10 words in the model, the first word has a count of 10 and an index of 0.

Using the following code:

word_vectors = api.load('glove-wiki-gigaword-100')
sif_model = uSIF(model=word_vectors)

The word_vectors.wv.vocab shows the first word to be: "the" and the count = 400000 and the index = 0 For each succeeding word in the model the count goes down by one, and the index goes up by 1.

Clearly this is not the frequency information.

I took this example from your jupyter workbook so I am assuming that something has changed with the models themselves? Any guidance on this would be helpful. I CAN create my on word2vec models and it has the frequency values as expected and the precalculation works as expected.

Thanks for any thoughts or guidance on this. Perhaps this is normal that none of these models retain the word frequencies.

Thanks,

Michael Wade

该提问来源于开源项目:oborchers/Fast_Sentence_Embeddings

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享

5条回答 默认 最新

  • weixin_39789979 weixin_39789979 2020-12-26 07:48

    Left out a key point, your tutorial.ipynb fails if you used uSIF instead of SIF because of this. (See error dump below)

    in ----> 1 model.train(s) 2

    ~/PycharmProjects/fse_test/venv/lib/python3.7/site-packages/fse/models/base_s2v.py in train(self, sentences, update, queue_factor, report_delay) 640 641 # Preform post-tain calls (i.e principal component removal) --> 642 self._post_train_calls() 643 644 self._log_train_end(eff_sentences=eff_sentences, eff_words=eff_words, overall_time=overall_time)

    ~/PycharmProjects/fse_test/venv/lib/python3.7/site-packages/fse/models/usif.py in _post_train_calls(self) 79 """ Function calls to perform after training, such as computing eigenvectors """ 80 if self.components > 0: ---> 81 self.svd_res = compute_principal_components(self.sv.vectors, components=self.components) 82 self.svd_weights = (self.svd_res[0] 2) / (self.svd_res[0] 2).sum().astype(REAL) 83 remove_principal_components(self.sv.vectors, svd_res=self.svd_res, weights=self.svd_weights, inplace=True)

    ~/PycharmProjects/fse_test/venv/lib/python3.7/site-packages/fse/models/utils.py in compute_principal_components(vectors, components) 32 start = time() 33 svd = TruncatedSVD(n_components=components, n_iter=7, random_state=42, algorithm="randomized") ---> 34 svd.fit(vectors) 35 elapsed = time() 36 logger.info(f"computing {components} principal components took {int(elapsed-start)}s")

    ~/PycharmProjects/fse_test/venv/lib/python3.7/site-packages/sklearn/decomposition/truncated_svd.py in fit(self, X, y) 139 Returns the transformer object. 140 """ --> 141 self.fit_transform(X) 142 return self 143

    ~/PycharmProjects/fse_test/venv/lib/python3.7/site-packages/sklearn/decomposition/truncated_svd.py in fit_transform(self, X, y) 158 """ 159 X = check_array(X, accept_sparse=['csr', 'csc'], --> 160 ensure_min_features=2) 161 random_state = check_random_state(self.random_state) 162

    ~/PycharmProjects/fse_test/venv/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 540 if force_all_finite: 541 _assert_all_finite(array, --> 542 allow_nan=force_all_finite == 'allow-nan') 543 544 if ensure_min_samples > 0:

    ~/PycharmProjects/fse_test/venv/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan) 54 not allow_nan and not np.isfinite(X).all()): 55 type_err = 'infinity' if allow_nan else 'NaN, infinity' ---> 56 raise ValueError(msg_err.format(type_err, X.dtype)) 57 # for object dtype data, we only check for NaNs (GH-13254) 58 elif X.dtype == np.dtype('object') and not allow_nan:

    ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

    点赞 评论 复制链接分享
  • weixin_39767887 weixin_39767887 2020-12-26 07:48

    Hi ,

    about your first point: Could you try again with the following argument: uSIF(model=word_vectors, lang_freq="en") Pre-trained models often don't come with frequency information. lang_freq induces word frequency information into a loaded model.

    I have to check the second posting though.

    点赞 评论 复制链接分享
  • weixin_39789979 weixin_39789979 2020-12-26 07:48

    Thanks, I was able to work-around this by using the pre-calculated english language frequencies. I was just surprised that the tutorial failed.

    点赞 评论 复制链接分享
  • weixin_39767887 weixin_39767887 2020-12-26 07:48

    Oh yes you are right! I can replicate the error! Much appreciated. Will look into this

    点赞 评论 复制链接分享
  • weixin_39767887 weixin_39767887 2020-12-26 07:48

    I've implemented a fix for this. You will be notified in future to use a model with valid word-frequency information. Furthermore, if you don't, fse will raise a runtime error to properly infer the frequency using lang_freq arg. The tutorial now works as well. Pushed to develop branch.

    点赞 评论 复制链接分享

相关推荐