Speed up HashingEncoder with util.hash_pandas_object · scikit-learn-contrib/category

Repository metrics

Stars: (2,322 stars)
PR merge metrics: (Avg merge 4d 10h) (2 merged PRs in 30d)

Description

Extend HashingEncoder to work with util.hash_pandas_object as the hashing function.

Reasoning: Currently, HashingEncoder relies on hashlib. Hashlib is nice, however:

hashlib works only value by value -> no vectorization
md5, which is the default hashing function, is possibly an overkill

On the other end, pandas vectorizes the hashing and it uses relatively simple hashing function, which could be enough for our purposes.

Based on the trivial experiment, pandas hashing seems to be 7x faster than md5 via hashlib on my laptop:

import hashlib
import time
import pandas as pd
import numpy as np

def hash_fn(x):
    for val in x.values:
        hasher = hashlib.new('md5')
        hasher.update(bytes(str(val), 'utf-8'))
        hasher.hexdigest()
    return None

X_pd = pd.DataFrame(np.random.randint(0, 10, (1000, 10)))

start = time.time()
pd.util.hash_pandas_object(X_pd) 
end = time.time()
print(end - start)  # 0.0090 seconds

start = time.time()
X_pd.apply(hash_fn, axis=1)
end = time.time()
print(end - start)  # 0.0643 seconds

Please, check first Feature Hashing for Large Scale Multitask Learning and util.hash_pandas_object source code to make sure that what is proposed here is a valid approach.

PS: For implementation, it could be wise to skip the multiprocessing in transform() and use single threaded _transform() instead. Also, it could be wise to use argument max_process=1 when initializing HashingEncoder.

Contributor guide

Research direction: Read the paper and pandas.util.hash pandas object source, then modify HashingEncoder to use hash pandas object instead of hashlib. Consider simplifying transform by removing multiprocessing.
Tech stack: python
Domain: data
Issue type: Performance
Difficulty: 3
Estimated time: Half day
Activity status: Active
Clarity: Clear
Prerequisites: Pythonpandascategory encoders
Newbie friendliness: 55

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.