ENH: sparse matrix assigning should be optimized rather than using densified matrixes · scipy/scipy#16601

(11 评论) (0 反应) (0 负责人)Python (11,968 star) (4,939 fork)batch import

enhancementgood first issuescipy.sparse

描述

Is your feature request related to a problem? Please describe.

According to the document of LIL sparse matrix, it is good at slicing and is proper to be used to construct sparse matrix. And I have found several packages using LIL sparse matrix to build an adjacent matrix like microsoft/recommenders.

However, the assigning of a sparse matrix to the slice of an LIL matrix will result in memory error, because the assigned matrix will be densified during the process. The call stack is like this below:

Traceback (most recent call last):
  File "test_scipy.py", line 22, in <module>
    mat_lil[1:mini_m+1, 10:mini_n+10] = mini_lil
  File "${PYTHON_ROOT}/lib/python3.7/site-packages/scipy/sparse/lil.py", line 346, in __setitem__
    IndexMixin.__setitem__(self, key, x)
  File "${PYTHON_ROOT}/lib/python3.7/site-packages/scipy/sparse/_index.py", line 103, in __setitem__
    self._set_arrayXarray_sparse(i, j, x)
  File "${PYTHON_ROOT}/lib/python3.7/site-packages/scipy/sparse/lil.py", line 332, in _set_arrayXarray_sparse
    x = np.asarray(x.toarray(), dtype=self.dtype)
  File "${PYTHON_ROOT}/lib/python3.7/site-packages/scipy/sparse/coo.py", line 323, in toarray
    B = self._process_toarray_args(order, out)
  File "${PYTHON_ROOT}/lib/python3.7/site-packages/scipy/sparse/base.py", line 1186, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
numpy.core._exceptions.MemoryError: Unable to allocate 186. GiB for an array with shape (5000, 10000000) and data type float32

This problem can be easily regenerated using the following code:

import scipy.sparse as sp
import numpy as np

rows = [0, 0, 4, 7]
cols = [1, 0, 3, 3]
vals = [2, 1, 3, 9]
m, n = int(1E4), int(1E8)
mat = sp.csr_matrix((vals, (rows, vals)), shape=(m, n), dtype=np.float32)

mini_rows = [0, 0, 4, 7]
mini_cols = [1, 0, 3, 3]
mini_vals = [2, 1, 3, 9]
mini_m, mini_n = int(5E3), int(1E7)
mini_mat = sp.csr_matrix((mini_vals, (mini_rows, mini_cols)), shape=(mini_m, mini_n), dtype=np.float32)

mat_lil = mat.tolil()
mini_lil = mini_mat.tolil()

mat_lil[1:mini_m+1, 10:mini_n+10] = mini_lil

From the code of scipy mentioned in the call stack, I noticed that the assignment of a sparse matrix will be degraded to dense matrix assignment. For a large sparse matrix, that may request an unbearable amount of memory.

For the example above, assigning LIL matrix to a slice of an LIL matrix should be optimized and efficiently processed using its sparse nature, rather than converting to a dense matrix.

I know using the sparse nature to assign maybe has to consider all of the possible pairs of the types of the assigning matrix and the assignee. But I think it is worthy as dense matrixes are unbearable for most of the cases in many actual scenarios.

Describe the solution you'd like.

The best solution would be implementation of all the possible pairs of the types of the assigning matrix and the assignee with sparse natures considered. For example, if an LIL matrix is assigned to the slice of another LIL matrix, the representation of each row of the assignee can be inserted or modified directly.

Describe alternatives you've considered.

If it is impossible to implement all of the possible pairs, maybe implementation for a specific sparse matrix type (LIL for example) and the corresponding documents would be enough.

Additional context (e.g. screenshots, GIFs)

No response

贡献者指南

技术栈: python
领域: performancebackend
议题类型: feature
难度: 3
预计时间: 3-5 days
活动状态: stale
清晰度: mostly clear
前置要求: Basic understanding of sparse matricesFamiliarity with SciPy internalsPython
新手友好度: 40
研究方向: First, reproduce the memory issue using the provided code to understand the current behavior. Then, examine the files `scipy/sparse/lil.py` and `scipy/sparse/ index.py`, specifically the ` set arrayXarray sparse` method in lil.py and the ` setitem ` method in index.py. The goal is to implement direct row level manipulation for LIL to LIL assignment without converting to dense. Review the SciPy contribution guide and sparse matrix development guidelines. Look at any existing discussions in the issue comments for partial solutions or suggestions.