random.RandomState with different versions of numpy has vastly different performance · piskvorky/gensim#2782

(15 comments) (0 reactions) (0 assignees)Python (15,144 stars) (4,349 forks)batch import

help wantedimpact MEDIUMperformancereach HIGH

説明

the performance of random.RandomState in word2vec.py (version 3.8.0)

def seeded_vector(self, seed_string, vector_size):
         once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
         return (once.rand(vector_size) - 0.5) / vector_size

seemingly depends greatly on the version of numpy installed. With numpy = 1.14.3, the following code

from  numpy.random import RandomState as Ran
from time import time
t1 = time()
for i in range(100000):
    temp = Ran(hash((i)) & 0xffffffff)
t2 = time()
t2-t1

produced 0.28105926513671875 exactly the same code with numpy= 1.18.1 produced

18.590345859527588

I noticed this because I was training a model with millions of words as vocabulary, and after updating numpy unwittingly (via a anaconda update), I noticed that the time for build_vocab was significantly longer, and after some debugging, I nailed it down to random.RandomState in the seeded_vector function. I know this is indeed a numpy issue, but even they mentioned it that RandomState is legacy (https://docs.scipy.org/doc/numpy/reference/random/performance.html). Therefore I wonder if you have some plans to upgrade randomstate? Thanks!

コントリビューターガイド

技術スタック: python
領域: performance
Issue 種別: performance
難度: 3
推定時間: half day
活動状況: stale
明確さ: mostly clear
前提条件: basic Pythonnumpy basics
初心者向け度: 30
調査方針: Investigate the performance difference of numpy.random.RandomState between versions 1.14.3 and 1.18.1 in the seeded vector function in gensim/word2vec.py. Consider replacing RandomState with numpy.random.Generator as numpy's recommended approach, and test for performance and correctness.