Speed up HashingEncoder with util.hash_pandas_object
#216 建立於 2019年10月11日
描述
Extend HashingEncoder to work with util.hash_pandas_object as the hashing function.
Reasoning: Currently, HashingEncoder relies on hashlib. Hashlib is nice, however:
- hashlib works only value by value -> no vectorization
- md5, which is the default hashing function, is possibly an overkill
On the other end, pandas vectorizes the hashing and it uses relatively simple hashing function, which could be enough for our purposes.
Based on the trivial experiment, pandas hashing seems to be 7x faster than md5 via hashlib on my laptop:
import hashlib
import time
import pandas as pd
import numpy as np
def hash_fn(x):
for val in x.values:
hasher = hashlib.new('md5')
hasher.update(bytes(str(val), 'utf-8'))
hasher.hexdigest()
return None
X_pd = pd.DataFrame(np.random.randint(0, 10, (1000, 10)))
start = time.time()
pd.util.hash_pandas_object(X_pd)
end = time.time()
print(end - start) # 0.0090 seconds
start = time.time()
X_pd.apply(hash_fn, axis=1)
end = time.time()
print(end - start) # 0.0643 seconds
Please, check first Feature Hashing for Large Scale Multitask Learning and util.hash_pandas_object source code to make sure that what is proposed here is a valid approach.
PS: For implementation, it could be wise to skip the multiprocessing in transform() and use single threaded _transform() instead. Also, it could be wise to use argument max_process=1 when initializing HashingEncoder.