Speed up HashingEncoder with util.hash_pandas_object · scikit-learn-contrib/category_encoders#216

(1 留言) (0 反應) (0 負責人)Python (2,322 star) (397 fork)batch import

enhancementhelp wanted

描述

Extend HashingEncoder to work with util.hash_pandas_object as the hashing function.

Reasoning: Currently, HashingEncoder relies on hashlib. Hashlib is nice, however:

hashlib works only value by value -> no vectorization
md5, which is the default hashing function, is possibly an overkill

On the other end, pandas vectorizes the hashing and it uses relatively simple hashing function, which could be enough for our purposes.

Based on the trivial experiment, pandas hashing seems to be 7x faster than md5 via hashlib on my laptop:

import hashlib
import time
import pandas as pd
import numpy as np

def hash_fn(x):
    for val in x.values:
        hasher = hashlib.new('md5')
        hasher.update(bytes(str(val), 'utf-8'))
        hasher.hexdigest()
    return None

X_pd = pd.DataFrame(np.random.randint(0, 10, (1000, 10)))

start = time.time()
pd.util.hash_pandas_object(X_pd) 
end = time.time()
print(end - start)  # 0.0090 seconds

start = time.time()
X_pd.apply(hash_fn, axis=1)
end = time.time()
print(end - start)  # 0.0643 seconds

Please, check first Feature Hashing for Large Scale Multitask Learning and util.hash_pandas_object source code to make sure that what is proposed here is a valid approach.

PS: For implementation, it could be wise to skip the multiprocessing in transform() and use single threaded _transform() instead. Also, it could be wise to use argument max_process=1 when initializing HashingEncoder.

貢獻者指南

技術棧: pythonpandasscikit learn
領域: datamachine learning
議題類型: performance
難度: 3
預計時間: half day
活動狀態: stale
清晰度: clear
前置要求: Pythonpandashashlibcategory encoders
新手友善度: 40
研究方向: Review the paper 'Feature Hashing for Large Scale Multitask Learning' and the source code of pandas.util.hash pandas object. Implement the change in HashingEncoder to use hash pandas object instead of hashlib.md5, and consider skipping multiprocessing in transform() by setting max process=1. The primary file to modify is likely category encoders/hashing.py.