piskvorky/gensim

random.RandomState with different versions of numpy has vastly different performance

Open

#2,782 创建于 2020年4月3日

在 GitHub 查看
 (15 评论) (0 反应) (0 负责人)Python (15,144 star) (4,349 fork)batch import
help wantedimpact MEDIUMperformancereach HIGH

描述

the performance of random.RandomState in word2vec.py (version 3.8.0)

def seeded_vector(self, seed_string, vector_size):
         once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
         return (once.rand(vector_size) - 0.5) / vector_size

seemingly depends greatly on the version of numpy installed. With numpy = 1.14.3, the following code

from  numpy.random import RandomState as Ran
from time import time
t1 = time()
for i in range(100000):
    temp = Ran(hash((i)) & 0xffffffff)
t2 = time()
t2-t1 

produced 0.28105926513671875 exactly the same code with numpy= 1.18.1 produced

18.590345859527588

I noticed this because I was training a model with millions of words as vocabulary, and after updating numpy unwittingly (via a anaconda update), I noticed that the time for build_vocab was significantly longer, and after some debugging, I nailed it down to random.RandomState in the seeded_vector function. I know this is indeed a numpy issue, but even they mentioned it that RandomState is legacy (https://docs.scipy.org/doc/numpy/reference/random/performance.html). Therefore I wonder if you have some plans to upgrade randomstate? Thanks!

贡献者指南

random.RandomState with different versions of numpy has vastly different performance · piskvorky/gensim#2782 | Good First Issue