CUSPARSE: Float16 sparse arrays fail on arithmetic of opposite transpositions · JuliaGPU/CUDA.jl#2284

(0 comments) (0 reactions) (0 assignees)Julia (1,408 stars) (274 forks)batch import

bugcuda librarieshelp wanted

説明

Description

Basic arithmetic of transposed sparse Float16 matrices, including addition, throws an exception during kernel execution (InvalidIRError). This happens when computing A + B' expressions, even for matching element types and positions of non-zero elements.

This happens specifically when one side remains in the original order, while the other side is transposed. It is also specific to Float16 elements, as using Float32 or Float64 instead causes no such issues.

CUSPARSE.CuSparseMatrixCSR{T} types are generally usable for T being Float32 or Float64, including the mentioned operation. As such, different behavior for Float16 specialization classifies as a bug.

Example

The following code:

using SparseArrays
using CUDA

T = Float16
A = CUSPARSE.CuSparseMatrixCSR{T}(sparse(T[0 1 0; 1 0 0; 0 0 1]))
B = CUSPARSE.CuSparseMatrixCSR{T}(sparse(T[0 1 0; 1 0 0; 0 0 1]))

CUDA.@allowscalar begin
    @show A + B
    @show A' + B'
    @show A + B'
end

Throws an error while computing contents of the last @show invocation. The crucial messages are:

A + B = Float16[0.0 2.0 0.0; 2.0 0.0 0.0; 0.0 0.0 2.0]
A' + B' = Float16[0.0 2.0 0.0; 2.0 0.0 0.0; 0.0 0.0 2.0]
ERROR: a exception was thrown during kernel execution.
(...)
ERROR: LoadError: InvalidIRError: compiling MethodInstance for CUDA.CUSPARSE.sparse_to_dense_broadcast_kernel(::Type{CUDA.CUSPARSE.CuSparseMatrixCSR{Float16, Int32}}, ::typeof(+), ::CuDeviceMatrix{Float16, 1}, ::CUDA.CUSPARSE.CuSparseDeviceMatrixCSR{Float16, Int32, 1}, ::LinearAlgebra.Adjoint{Float16, CUDA.CUSPARSE.CuSparseDeviceMatrixCSR{Float16, Int32, 1}}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to julia.new_gc_frame)
Stacktrace:
 [1] ntuple
   @ ./ntuple.jl:49
 [2] sparse_to_dense_broadcast_kernel
   @ ~/.julia/packages/CUDA/htRwP/lib/cusparse/broadcast.jl:429

Computing A' + B yields the same error. Setting T = Float32 or T = Float64 makes the snippet execute properly.

Similar behavior can be observed when the same matrix is used on both sides, i.e. on computing A + A'. This breaks more common operations, such as LinearAlgebra.issymmetric.

Expected behavior

The expected output, analogous to other types, would be:

A + B = Float16[0.0 2.0 0.0; 2.0 0.0 0.0; 0.0 0.0 2.0]
A' + B' = Float16[0.0 2.0 0.0; 2.0 0.0 0.0; 0.0 0.0 2.0]
A + B' = Float16[0.0 2.0 0.0; 2.0 0.0 0.0; 0.0 0.0 2.0]

Version info

Details on Julia:

Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake-avx512)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

Details on CUDA:

CUDA runtime 12.3, artifact installation
CUDA driver 12.2
NVIDIA driver 535.104.12

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 12.0.0+535.104.12

Julia packages: 
- CUDA: 5.2.0
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.11.1+0

Toolchain:
- Julia: 1.10.2
- LLVM: 15.0.7

2 devices:
  0: NVIDIA GeForce RTX 4090 (sm_89, 23.643 GiB / 23.988 GiB available)
  1: NVIDIA GeForce RTX 4090 (sm_89, 23.642 GiB / 23.988 GiB available)

コントリビューターガイド

技術スタック: なし
領域: backend
Issue 種別: bug
難度: 5
推定時間: over 1 week
活動状況: fresh
明確さ: clear
前提条件: Julia programmingCUDA basicsCUSPARSE APILLVM IR understanding
初心者向け度: 10
調査方針: Investigate the sparse to dense broadcast kernel in broadcast.jl at line 429. The kernel fails when compiling with Float16 and transposed matrices. This may involve examining LLVM IR generation or missing Float16 support in a called function. Check if the issue persists with the latest CUDA.jl version and consider potential workarounds.