piskvorky/gensim

random.RandomState with different versions of numpy has vastly different performance

Open

#2,782 opened on 2020年4月3日

GitHub で見る
 (15 comments) (0 reactions) (0 assignees)Python (15,144 stars) (4,349 forks)batch import
help wantedimpact MEDIUMperformancereach HIGH

説明

the performance of random.RandomState in word2vec.py (version 3.8.0)

def seeded_vector(self, seed_string, vector_size):
         once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
         return (once.rand(vector_size) - 0.5) / vector_size

seemingly depends greatly on the version of numpy installed. With numpy = 1.14.3, the following code

from  numpy.random import RandomState as Ran
from time import time
t1 = time()
for i in range(100000):
    temp = Ran(hash((i)) & 0xffffffff)
t2 = time()
t2-t1 

produced 0.28105926513671875 exactly the same code with numpy= 1.18.1 produced

18.590345859527588

I noticed this because I was training a model with millions of words as vocabulary, and after updating numpy unwittingly (via a anaconda update), I noticed that the time for build_vocab was significantly longer, and after some debugging, I nailed it down to random.RandomState in the seeded_vector function. I know this is indeed a numpy issue, but even they mentioned it that RandomState is legacy (https://docs.scipy.org/doc/numpy/reference/random/performance.html). Therefore I wonder if you have some plans to upgrade randomstate? Thanks!

コントリビューターガイド

random.RandomState with different versions of numpy has vastly different performance · piskvorky/gensim#2782 | Good First Issue