BugModeratehelp wantedmodule:preprocessing
描述
The recent implementation of partial_fit for StandardScaler can overflow. A use case there is to transform indefinitely long stream of data, but that is problematic with the current implementation. The reason is that to compute the running mean, we keep track of the sample sum.
Here the code to reproduce the behavior. To simulate long stream of data would take long time; instead, I use samples with very large norm but the effect is the same. The same batch is presented to the transformer many times. The mean should be same.
from sklearn.preprocessing import StandardScaler
import numpy as np
rng = np.random.RandomState(0)
def gen_1d_uniform_batch(min_, max_, n):
return rng.uniform(min_, max_, size=(n, 1))
max_f = np.finfo(np.float64).max / 1e5
min_f = max_f / 1e2
stream_dim = 100
batch_dim = 500000
print("mean overflow: batch vs online on %d repetitions" % stream_dim)
X = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)
scaler = StandardScaler(with_std=False).fit(X)
print(scaler.mean_)
[ 1.79769313e+301]
iscaler = StandardScaler(with_std=False)
batch = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)
for _ in range(stream_dim):
iscaler = iscaler.partial_fit(batch)
RuntimeWarning: overflow encountered in add
updated_mean = (last_sum + new_sum) / updated_sample_count
print(iscaler.mean_)
[ inf]