WordEmbeddingsKeyedVectors.add() doesn't clear `vectors_norm`, causing `IndexError` on later `most_similar()` · piskvorky/gensim#2532

(17 留言) (0 反應) (0 負責人)Python (15,144 star) (4,349 fork)batch import

Hacktoberfestbugdifficulty easygood first issueimpact MEDIUMreach LOW

描述

As reported in a StackOverflow question/answer: https://stackoverflow.com/a/56641265/130288

An adapted version of the asker's minimal test case (which could become a unit test):

import numpy as np
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors

kv = WordEmbeddingsKeyedVectors(vector_size=3)
kv.add(entities=['a', 'b'],
       weights=[np.random.rand(3), np.random.rand(3)])
kv.most_similar('a')  # works

kv.add(entities=['c'], weights=[np.random.rand(3)])
kv.most_similar('c')  # fails with `IndexError`

Clearing the vectors_norm property (with either del or assignment-to-None) should be sufficient to trigger re-calculation upon the next most_similar().

貢獻者指南

技術棧: pythonnumpy
領域: machine learning
議題類型: bug
難度: 3
預計時間: 1-3 hours
活動狀態: stale
清晰度: clear
前置要求: pythonbasic numpyfamiliarity with gensim
新手友善度: 60
研究方向: Look at the file `gensim/models/keyedvectors.py`, specifically the `add()` method and the `most similar()` method. The issue is that `add()` does not clear the cached `vectors norm` property, causing `most similar()` to fail with `IndexError` when called after adding new entities. The fix is to set `self.vectors norm = None` inside `add()` after modifying the vectors. The provided test case can be used to verify the fix. No maintainer clarification needed; the solution is straightforward.