random.RandomState with different versions of numpy has vastly different performance · piskvorky/gensim#2782

(15 评论) (0 反应) (0 负责人)Python (15,144 star) (4,349 fork)batch import

help wantedimpact MEDIUMperformancereach HIGH

描述

the performance of random.RandomState in word2vec.py (version 3.8.0)

def seeded_vector(self, seed_string, vector_size):
         once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
         return (once.rand(vector_size) - 0.5) / vector_size

seemingly depends greatly on the version of numpy installed. With numpy = 1.14.3, the following code

from  numpy.random import RandomState as Ran
from time import time
t1 = time()
for i in range(100000):
    temp = Ran(hash((i)) & 0xffffffff)
t2 = time()
t2-t1

produced 0.28105926513671875 exactly the same code with numpy= 1.18.1 produced

18.590345859527588

I noticed this because I was training a model with millions of words as vocabulary, and after updating numpy unwittingly (via a anaconda update), I noticed that the time for build_vocab was significantly longer, and after some debugging, I nailed it down to random.RandomState in the seeded_vector function. I know this is indeed a numpy issue, but even they mentioned it that RandomState is legacy (https://docs.scipy.org/doc/numpy/reference/random/performance.html). Therefore I wonder if you have some plans to upgrade randomstate? Thanks!

贡献者指南

技术栈: python
领域: performance
议题类型: performance
难度: 3
预计时间: half day
活动状态: stale
清晰度: mostly clear
前置要求: basic Pythonnumpy basics
新手友好度: 30
研究方向: Investigate the performance difference of numpy.random.RandomState between versions 1.14.3 and 1.18.1 in the seeded vector function in gensim/word2vec.py. Consider replacing RandomState with numpy.random.Generator as numpy's recommended approach, and test for performance and correctness.