scikit-learn-contrib/category_encoders

Speed up HashingEncoder with util.hash_pandas_object

Open

#216 opened on Oct 11, 2019

View on GitHub
 (1 comment) (0 reactions) (0 assignees)Python (2,322 stars) (397 forks)batch import
enhancementhelp wanted

Description

Extend HashingEncoder to work with util.hash_pandas_object as the hashing function.

Reasoning: Currently, HashingEncoder relies on hashlib. Hashlib is nice, however:

  1. hashlib works only value by value -> no vectorization
  2. md5, which is the default hashing function, is possibly an overkill

On the other end, pandas vectorizes the hashing and it uses relatively simple hashing function, which could be enough for our purposes.

Based on the trivial experiment, pandas hashing seems to be 7x faster than md5 via hashlib on my laptop:

import hashlib
import time
import pandas as pd
import numpy as np

def hash_fn(x):
    for val in x.values:
        hasher = hashlib.new('md5')
        hasher.update(bytes(str(val), 'utf-8'))
        hasher.hexdigest()
    return None

X_pd = pd.DataFrame(np.random.randint(0, 10, (1000, 10)))

start = time.time()
pd.util.hash_pandas_object(X_pd) 
end = time.time()
print(end - start)  # 0.0090 seconds

start = time.time()
X_pd.apply(hash_fn, axis=1)
end = time.time()
print(end - start)  # 0.0643 seconds

Please, check first Feature Hashing for Large Scale Multitask Learning and util.hash_pandas_object source code to make sure that what is proposed here is a valid approach.

PS: For implementation, it could be wise to skip the multiprocessing in transform() and use single threaded _transform() instead. Also, it could be wise to use argument max_process=1 when initializing HashingEncoder.

Contributor guide

Speed up HashingEncoder with util.hash_pandas_object · scikit-learn-contrib/category_encoders#216 | Good First Issue