Support CuDeviceArray construction with non-Int dims · JuliaGPU/CUDA.jl#2846

(1 comment) (0 reactions) (0 assignees)Julia (1,408 stars) (274 forks)batch import

cuda kernelsgood first issue

説明

Bug description

When attempting to allocate shared memory for use inside a kernel, it appears that the type information of the specified dimensions can influence whether the compilation proceeds correctly or produces an InvalidIRError.

Minimal working example

This simple snippet demonstrates the issue quite succinctly.

import CUDA, KernelAbstractions as KA

KA.@kernel function example(::Val{shmem_dims}) where {shmem_dims}
    shmem = KA.@localmem UInt32 shmem_dims
    KA.@index(Global, Linear) == 1 && KA.@print(size(shmem), "\n")
end

instant = example(CUDA.CUDABackend())
instant(Val(1); ndrange = 1); CUDA.synchronize(); # Output: (1,)
instant(Val(0x1); ndrange = 1); CUDA.synchronize(); # Output: InvalidIRError

[[deps.CUDA]]
version = "5.8.3" (Issue also present in the development branch at the time of commit d6ad9c3)

[[deps.GPUArrays]]
version = "11.2.3"

[[deps.GPUArraysCore]]
version = "0.2.0"

[[deps.GPUCompiler]]
version = "1.6.1"

[[deps.LLVM]]
version = "9.4.2"

Expected behaviour

The allocation of shared memory should only depend on the integral value(s) of the dimension argument(s) and not their underlying type.

Version information

Julia:

Julia Version 1.11.6
Commit 9615af0f269 (2025-07-09 12:58 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 12 virtual cores)

CUDA:

CUDA toolchain: 
- runtime 12.9, artifact installation
- driver 535.247.1 for 12.2
- compiler 12.9

CUDA libraries: 
- CUBLAS: 12.9.1
- CURAND: 10.3.10
- CUFFT: 11.4.1
- CUSOLVER: 11.7.5
- CUSPARSE: 12.5.10
- CUPTI: 2025.2.1 (API 12.9.1)
- NVML: 12.0.0+535.247.1

Julia packages: 
- CUDA: 5.8.3
- CUDA_Driver_jll: 13.0.0+0
- CUDA_Compiler_jll: 0.2.0+1
- CUDA_Runtime_jll: 0.19.0+0

Toolchain:
- Julia: 1.11.6
- LLVM: 16.0.6

1 device:
  0: NVIDIA GeForce RTX 2070 (sm_75, 7.677 GiB / 8.000 GiB available)

Trace

The aforementioned minimal working example produces the following stack trace with show(err).

1-element ExceptionStack:
InvalidIRError: compiling MethodInstance for gpu_example(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, ::Val{0x01}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jl_f_apply_type)
Stacktrace:
 [1] Val
   @ ./essentials.jl:1037
 [2] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:18
 [3] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:21
 [4] #SharedMemory
   @ ~/.julia/dev/CUDA/src/CUDAKernels.jl:199
 [5] macro expansion
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:242
 [6] macro expansion
   @ ./REPL[2]:2
 [7] gpu_example
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/macros.jl:324
 [8] gpu_example
   @ ./none:0
Reason: unsupported call to an unknown function (call to ijl_new_structv)
Stacktrace:
 [1] Val
   @ ./essentials.jl:1035
 [2] Val
   @ ./essentials.jl:1037
 [3] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:18
 [4] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:21
 [5] #SharedMemory
   @ ~/.julia/dev/CUDA/src/CUDAKernels.jl:199
 [6] macro expansion
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:242
 [7] macro expansion
   @ ./REPL[2]:2
 [8] gpu_example
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/macros.jl:324
 [9] gpu_example
   @ ./none:0
Reason: unsupported dynamic function invocation (call to emit_shmem)
Stacktrace:
 [1] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:18
 [2] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:21
 [3] #SharedMemory
   @ ~/.julia/dev/CUDA/src/CUDAKernels.jl:199
 [4] macro expansion
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:242
 [5] macro expansion
   @ ./REPL[2]:2
 [6] gpu_example
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/macros.jl:324
 [7] gpu_example
   @ ./none:0
Reason: unsupported dynamic function invocation (call to CUDA.CuDeviceVector{UInt32, 3})
Stacktrace:
 [1] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:19
 [2] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:21
 [3] #SharedMemory
   @ ~/.julia/dev/CUDA/src/CUDAKernels.jl:199
 [4] macro expansion
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:242
 [5] macro expansion
   @ ./REPL[2]:2
 [6] gpu_example
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/macros.jl:324
 [7] gpu_example
   @ ./none:0
Reason: unsupported dynamic function invocation (call to CUDA.CuDeviceVector{UInt32, 3})
Stacktrace:
 [1] CuDeviceArray
   @ ~/.julia/dev/CUDA/src/device/array.jl:31
 [2] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:19
 [3] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:21
 [4] #SharedMemory
   @ ~/.julia/dev/CUDA/src/CUDAKernels.jl:199
 [5] macro expansion
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:242
 [6] macro expansion
   @ ./REPL[2]:2
 [7] gpu_example
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/macros.jl:324
 [8] gpu_example
   @ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erroneous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/validation.jl:167
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:385 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/Tracy/slmNc/src/tracepoint.jl:163 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:384 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/utils.jl:116
  [6] emit_llvm(job::GPUCompiler.CompilerJob)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/utils.jl:114
  [7] compile_unhooked(output::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:95
  [8] compile_unhooked
    @ ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:80 [inlined]
  [9] compile(target::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:67
 [10] compile
    @ ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:55 [inlined]
 [11] #1182
    @ ~/.julia/dev/CUDA/src/compiler/compilation.jl:250 [inlined]
 [12] JuliaContext(f::CUDA.var"#1182#1185"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:34
 [13] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:25
 [14] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/dev/CUDA/src/compiler/compilation.jl:249
 [15] actual_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/execution.jl:245
 [16] cached_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/execution.jl:159
 [17] macro expansion
    @ ~/.julia/dev/CUDA/src/compiler/execution.jl:373 [inlined]
 [18] macro expansion
    @ ./lock.jl:273 [inlined]
 [19] cufunction(f::typeof(gpu_example), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Val{0x01}}}; kwargs::@Kwargs{always_inline::Bool, maxthreads::Nothing})
    @ CUDA ~/.julia/dev/CUDA/src/compiler/execution.jl:368
 [20] cufunction
    @ ~/.julia/dev/CUDA/src/compiler/execution.jl:365 [inlined]
 [21] macro expansion
    @ ~/.julia/dev/CUDA/src/compiler/execution.jl:112 [inlined]
 [22] (::KernelAbstractions.Kernel{CUDA.CUDAKernels.CUDABackend, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_example)})(args::Val{0x01}; ndrange::Int64, workgroupsize::Nothing)
    @ CUDA.CUDAKernels ~/.julia/dev/CUDA/src/CUDAKernels.jl:124
 [23] top-level scope
    @ REPL[5]:1

コントリビューターガイド

技術スタック: なし
領域: backend
Issue 種別: bug
難度: 3
推定時間: 1-3 hours
活動状況: fresh
明確さ: clear
前提条件: Julia programmingCUDA.jl basicsKernelAbstractions.jl usage
初心者向け度: 60
調査方針: The bug occurs when non Int dimension types (e.g., UInt8) are passed to CuStaticSharedArray, causing invalid LLVM IR. The error trace points to two files: `src/device/intrinsics/shared memory.jl` (lines 18-21 in CuStaticSharedArray) and `src/device/array.jl` (line 31 in CuDeviceArray). The issue is that the constructor for CuDeviceArray calls `Val` with the dimension, which fails for non Int types. The fix should ensure that dimension arguments are converted to `Int` before being used to construct the array type. Review the current implementation in those files and consider adding a conversion to `Int` for the dimensions. The existing tests in `test/device/shared memory.jl` should be expanded to include non Int dimensions.