scikit-learn/scikit-learn

Single linkage option in Agglomerative causes MemoryError for very large numbers

Open

#17,960 opened on Jul 20, 2020

View on GitHub
 (6 comments) (0 reactions) (0 assignees)Python (66,084 stars) (27,020 forks)batch import
Bughelp wantedmodule:cluster

Description

Describe the bug

Using the single linkage option in Agglomerative clustering results in a MemoryError when the input data contains very large values, likely due to numeric overflows.

Steps/Code to Reproduce

Example: Running the code below with half of the sample data does not produce a MemoryError (although the same numpy RuntimeWarning is being thrown). However, if the data contains as many very large numbers as in my sample below my machine with 16Gb RAM runs out of memory and I get a MemoryError. To me this seems a bit disproportionate as the sample data is not very large.

from sklearn.cluster import AgglomerativeClustering

X = [[1.30830774e+307, 6.02217328e+307],
     [1.54166067e+308, 1.75812744e+308],
     [5.57938866e+307, 4.13840113e+307],
     [1.36302835e+308, 1.07968131e+308],
     [1.58772669e+308, 1.19380571e+307],
     [2.20362426e+307, 1.58814671e+308],
     [1.06216028e+308, 1.14258583e+308],
     [7.18031911e+307, 1.69661213e+308],
     [7.91182553e+307, 5.12892426e+307],
     [5.58470885e+307, 9.13566765e+306],
     [1.22366243e+308, 8.29427922e+307],
     [4.39205961e+306, 1.26048413e+308],
     [4.61599953e+306, 7.24075646e+307],
     [1.66596896e+308, 1.65498552e+308],
     [4.73958815e+307, 7.66412710e+307],
     [1.57013390e+308, 1.03527051e+308],
     [1.28464631e+308, 1.68216358e+308],
     [1.07121506e+308, 3.11489418e+307],
     [2.97524276e+307, 1.59238260e+306],
     [1.24529964e+308, 5.18478922e+306],
     [7.55088957e+307, 7.08726240e+307],
     [8.38161061e+307, 1.50363727e+307],
     [1.11502738e+308, 3.77564517e+307],
     [1.33977566e+308, 4.21606136e+307],
     [3.18410347e+306, 1.41182675e+308],
     [1.37949806e+307, 1.33091975e+308],
     [8.84125355e+307, 8.83334687e+307],
     [4.16683700e+307, 4.38042780e+307],
     [6.33989592e+307, 1.13127438e+308],
     [3.30270525e+307, 5.82271903e+307],
     [1.73464450e+308, 1.57922601e+308],
     [8.50840567e+307, 8.34750980e+307],
     [1.06389805e+308, 7.01361194e+307],
     [1.20971776e+308, 1.42173002e+308],
     [1.91271271e+307, 7.63302091e+307],
     [1.46375293e+308, 3.27352625e+307],
     [5.64255485e+307, 1.62562194e+308],
     [5.96180877e+307, 4.31806469e+307],
     [1.18773200e+308, 4.91146568e+307],
     [1.19298612e+307, 1.54110558e+308],
     [3.51302340e+307, 5.60375139e+307],
     [5.14281492e+307, 9.76302343e+307],
     [1.77408023e+308, 1.65760854e+308],
     [1.46544741e+308, 1.47445640e+308],
     [1.76109403e+308, 1.30923042e+308],
     [1.64984146e+308, 2.58303609e+307],
     [1.61558758e+307, 1.03985868e+308],
     [1.37977676e+308, 4.90921157e+307],
     [1.01105745e+308, 1.57678709e+307],
     [1.24672794e+308, 5.96657664e+307]]

clusterer = AgglomerativeClustering(linkage='single')
clusterer.fit(X)

Expected Results

No error is thrown or ValueError that specifies the range of allowed values. GaussianMixture clustering is handling the same data like this:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Actual Results

C:\Program Files\Python37\lib\site-packages\numpy\core\fromnumeric.py:90: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

Traceback (most recent call last):
  File "C:/Users/thaar/PycharmProjects/sklearn-dev/agglomerative_test.py", line 58, in <module>
    clusterer.fit(X)
  File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_agglomerative.py", line 898, in fit
    self.n_leaves_)
  File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_agglomerative.py", line 674, in _hc_cut
    label[_hierarchical._hc_get_descendent(-node, children, n_leaves)] = i
  File "sklearn\cluster\_hierarchical_fast.pyx", line 96, in sklearn.cluster._hierarchical_fast._hc_get_descendent
MemoryError

Versions

System: python: 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)] executable: C:\Users\thaar\PycharmProjects\sklearn-dev\venv\Scripts\python.exe machine: Windows-10-10.0.18362-SP0

Python dependencies: pip: 20.1.1 setuptools: 49.2.0 sklearn: 0.23.1 numpy: 1.18.4 scipy: 1.4.1 Cython: None pandas: 1.0.3 matplotlib: 3.2.1 joblib: 0.14.1 threadpoolctl: 2.0.0

Built with OpenMP: True

Contributor guide