scipy/scipy

ENH: sparse matrix assigning should be optimized rather than using densified matrixes

Open

#16,601 创建于 2022年7月14日

在 GitHub 查看
 (11 评论) (0 反应) (0 负责人)Python (11,968 star) (4,939 fork)batch import
enhancementgood first issuescipy.sparse

描述

Is your feature request related to a problem? Please describe.

According to the document of LIL sparse matrix, it is good at slicing and is proper to be used to construct sparse matrix. And I have found several packages using LIL sparse matrix to build an adjacent matrix like microsoft/recommenders.

However, the assigning of a sparse matrix to the slice of an LIL matrix will result in memory error, because the assigned matrix will be densified during the process. The call stack is like this below:

Traceback (most recent call last):
  File "test_scipy.py", line 22, in <module>
    mat_lil[1:mini_m+1, 10:mini_n+10] = mini_lil
  File "${PYTHON_ROOT}/lib/python3.7/site-packages/scipy/sparse/lil.py", line 346, in __setitem__
    IndexMixin.__setitem__(self, key, x)
  File "${PYTHON_ROOT}/lib/python3.7/site-packages/scipy/sparse/_index.py", line 103, in __setitem__
    self._set_arrayXarray_sparse(i, j, x)
  File "${PYTHON_ROOT}/lib/python3.7/site-packages/scipy/sparse/lil.py", line 332, in _set_arrayXarray_sparse
    x = np.asarray(x.toarray(), dtype=self.dtype)
  File "${PYTHON_ROOT}/lib/python3.7/site-packages/scipy/sparse/coo.py", line 323, in toarray
    B = self._process_toarray_args(order, out)
  File "${PYTHON_ROOT}/lib/python3.7/site-packages/scipy/sparse/base.py", line 1186, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
numpy.core._exceptions.MemoryError: Unable to allocate 186. GiB for an array with shape (5000, 10000000) and data type float32

This problem can be easily regenerated using the following code:

import scipy.sparse as sp
import numpy as np

rows = [0, 0, 4, 7]
cols = [1, 0, 3, 3]
vals = [2, 1, 3, 9]
m, n = int(1E4), int(1E8)
mat = sp.csr_matrix((vals, (rows, vals)), shape=(m, n), dtype=np.float32)

mini_rows = [0, 0, 4, 7]
mini_cols = [1, 0, 3, 3]
mini_vals = [2, 1, 3, 9]
mini_m, mini_n = int(5E3), int(1E7)
mini_mat = sp.csr_matrix((mini_vals, (mini_rows, mini_cols)), shape=(mini_m, mini_n), dtype=np.float32)

mat_lil = mat.tolil()
mini_lil = mini_mat.tolil()

mat_lil[1:mini_m+1, 10:mini_n+10] = mini_lil

From the code of scipy mentioned in the call stack, I noticed that the assignment of a sparse matrix will be degraded to dense matrix assignment. For a large sparse matrix, that may request an unbearable amount of memory.

For the example above, assigning LIL matrix to a slice of an LIL matrix should be optimized and efficiently processed using its sparse nature, rather than converting to a dense matrix.

I know using the sparse nature to assign maybe has to consider all of the possible pairs of the types of the assigning matrix and the assignee. But I think it is worthy as dense matrixes are unbearable for most of the cases in many actual scenarios.

Describe the solution you'd like.

The best solution would be implementation of all the possible pairs of the types of the assigning matrix and the assignee with sparse natures considered. For example, if an LIL matrix is assigned to the slice of another LIL matrix, the representation of each row of the assignee can be inserted or modified directly.

Describe alternatives you've considered.

If it is impossible to implement all of the possible pairs, maybe implementation for a specific sparse matrix type (LIL for example) and the corresponding documents would be enough.

Additional context (e.g. screenshots, GIFs)

No response

贡献者指南