scikit-learn-contrib/category_encoders

Poor performance of OneHotEncoder for category_encoders version >=2.0.0

Open

#362 opened on 2022年7月12日

GitHub で見る
 (3 comments) (1 reaction) (0 assignees)Python (2,322 stars) (397 forks)batch import
enhancementgood first issue

説明

Expected Behavior

Similar memory usage for the different category_encoders versions or better performance for higher category_encoders versions.

Actual Behavior

According to the experiment results, when the category_encoders version is higher than 2.0.0, the performance of the model is worse.

Memory(MB) Version
896 2.3.0
896 2.2.2
896 2.1.0
896 2.0.0
288 1.3.0

Steps to Reproduce the Problem

Step 1: download above dataset train & test (63MB) Step 2: install category_encoders

pip install  category_encoders == #version#

Step 3: change category_encoders version and save the memory usage

import numpy as np 
import pandas as pd
import category_encoders as ce
import tracemalloc
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
df_train.drop("id", axis=1, inplace=True) 
df_test.drop("id", axis=1, inplace=True) 
cat_labels = [f"cat{i}" for i in range(10)]

tracemalloc.start()
onehot_encoder = ce.one_hot.OneHotEncoder() 
onehot_encoder.fit(pd.concat([df_train[cat_labels], df_test[cat_labels]], axis=0))
train_ohe = onehot_encoder.transform(df_train[cat_labels]) 
test_ohe = onehot_encoder.transform(df_test[cat_labels]) 

current3, peak3 = tracemalloc.get_traced_memory()
print("Get_dummies memory usage is {",current3 /1024/1024,"}MB; Peak memory was :{",peak3 / 1024/1024,"}MB")

Specifications

  • Version: 2.3.0, 2.2.2, 2.1.0, 2.0.0, 1.3.0
  • Platform: ubuntu 16.4
  • OS : Ubuntu
  • CPU : Intel(R) Core(TM) i9-9900K CPU
  • GPU : TITAN V

コントリビューターガイド