scikit-learn-contrib/category_encoders
在 GitHub 查看Poor performance of OneHotEncoder for category_encoders version >=2.0.0
Open
#362 创建于 2022年7月12日
enhancementgood first issue
描述
Expected Behavior
Similar memory usage for the different category_encoders versions or better performance for higher category_encoders versions.
Actual Behavior
According to the experiment results, when the category_encoders version is higher than 2.0.0, the performance of the model is worse.
| Memory(MB) | Version |
|---|---|
| 896 | 2.3.0 |
| 896 | 2.2.2 |
| 896 | 2.1.0 |
| 896 | 2.0.0 |
| 288 | 1.3.0 |
Steps to Reproduce the Problem
Step 1: download above dataset train & test (63MB) Step 2: install category_encoders
pip install category_encoders == #version#
Step 3: change category_encoders version and save the memory usage
import numpy as np
import pandas as pd
import category_encoders as ce
import tracemalloc
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
df_train.drop("id", axis=1, inplace=True)
df_test.drop("id", axis=1, inplace=True)
cat_labels = [f"cat{i}" for i in range(10)]
tracemalloc.start()
onehot_encoder = ce.one_hot.OneHotEncoder()
onehot_encoder.fit(pd.concat([df_train[cat_labels], df_test[cat_labels]], axis=0))
train_ohe = onehot_encoder.transform(df_train[cat_labels])
test_ohe = onehot_encoder.transform(df_test[cat_labels])
current3, peak3 = tracemalloc.get_traced_memory()
print("Get_dummies memory usage is {",current3 /1024/1024,"}MB; Peak memory was :{",peak3 / 1024/1024,"}MB")
Specifications
- Version: 2.3.0, 2.2.2, 2.1.0, 2.0.0, 1.3.0
- Platform: ubuntu 16.4
- OS : Ubuntu
- CPU : Intel(R) Core(TM) i9-9900K CPU
- GPU : TITAN V