BUG: StandardScaler partial_fit overflows · scikit-learn/scikit-learn#5602

(10 评论) (0 反应) (0 负责人)Python (27,020 fork)batch import

BugModeratehelp wantedmodule:preprocessing

仓库指标

Star: (66,084 star)
PR 合并指标: (平均合并 10天) (30 天内合并 90 个 PR)

描述

The recent implementation of partial_fit for StandardScaler can overflow. A use case there is to transform indefinitely long stream of data, but that is problematic with the current implementation. The reason is that to compute the running mean, we keep track of the sample sum.

Here the code to reproduce the behavior. To simulate long stream of data would take long time; instead, I use samples with very large norm but the effect is the same. The same batch is presented to the transformer many times. The mean should be same.

from sklearn.preprocessing import StandardScaler
import numpy as np

rng = np.random.RandomState(0)

def gen_1d_uniform_batch(min_, max_, n):
    return rng.uniform(min_, max_, size=(n, 1))

max_f = np.finfo(np.float64).max / 1e5
min_f = max_f / 1e2
stream_dim = 100
batch_dim = 500000
print("mean overflow: batch vs online on %d repetitions" % stream_dim)

X = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)

scaler = StandardScaler(with_std=False).fit(X)
print(scaler.mean_)
[  1.79769313e+301]

iscaler = StandardScaler(with_std=False)
batch = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)
for _ in range(stream_dim):
    iscaler = iscaler.partial_fit(batch)
RuntimeWarning: overflow encountered in add
  updated_mean = (last_sum + new_sum) / updated_sample_count

print(iscaler.mean_)
[ inf]

贡献者指南

研究方向: 研究用于在线均值和方差计算的数值稳定算法，例如Welford算法，以防止StandardScaler的partial fit方法中出现溢出。
技术栈: python
领域: machine learning
议题类型: 缺陷
难度: 3
预计时间: 1-2 天
活动状态: 活跃
清晰度: 清晰
前置要求: PythonNumPyscikit learn
新手友好度: 65

仓库指标

描述

贡献者指南

每天在邮箱收到新鲜 Easy issues。