JuliaGPU/CUDA.jl

Support CuDeviceArray construction with non-Int dims

Open

#2,846 opened on 2025年8月13日

GitHub で見る
 (1 comment) (0 reactions) (0 assignees)Julia (1,408 stars) (274 forks)batch import
cuda kernelsgood first issue

説明

Bug description

When attempting to allocate shared memory for use inside a kernel, it appears that the type information of the specified dimensions can influence whether the compilation proceeds correctly or produces an InvalidIRError.

Minimal working example

This simple snippet demonstrates the issue quite succinctly.

import CUDA, KernelAbstractions as KA

KA.@kernel function example(::Val{shmem_dims}) where {shmem_dims}
    shmem = KA.@localmem UInt32 shmem_dims
    KA.@index(Global, Linear) == 1 && KA.@print(size(shmem), "\n")
end

instant = example(CUDA.CUDABackend())
instant(Val(1); ndrange = 1); CUDA.synchronize(); # Output: (1,)
instant(Val(0x1); ndrange = 1); CUDA.synchronize(); # Output: InvalidIRError
[[deps.CUDA]]
version = "5.8.3" (Issue also present in the development branch at the time of commit d6ad9c3)

[[deps.GPUArrays]]
version = "11.2.3"

[[deps.GPUArraysCore]]
version = "0.2.0"

[[deps.GPUCompiler]]
version = "1.6.1"

[[deps.LLVM]]
version = "9.4.2"

Expected behaviour

The allocation of shared memory should only depend on the integral value(s) of the dimension argument(s) and not their underlying type.

Version information

Julia:

Julia Version 1.11.6
Commit 9615af0f269 (2025-07-09 12:58 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 12 virtual cores)

CUDA:

CUDA toolchain: 
- runtime 12.9, artifact installation
- driver 535.247.1 for 12.2
- compiler 12.9

CUDA libraries: 
- CUBLAS: 12.9.1
- CURAND: 10.3.10
- CUFFT: 11.4.1
- CUSOLVER: 11.7.5
- CUSPARSE: 12.5.10
- CUPTI: 2025.2.1 (API 12.9.1)
- NVML: 12.0.0+535.247.1

Julia packages: 
- CUDA: 5.8.3
- CUDA_Driver_jll: 13.0.0+0
- CUDA_Compiler_jll: 0.2.0+1
- CUDA_Runtime_jll: 0.19.0+0

Toolchain:
- Julia: 1.11.6
- LLVM: 16.0.6

1 device:
  0: NVIDIA GeForce RTX 2070 (sm_75, 7.677 GiB / 8.000 GiB available)

Trace

The aforementioned minimal working example produces the following stack trace with show(err).

1-element ExceptionStack:
InvalidIRError: compiling MethodInstance for gpu_example(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, ::Val{0x01}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jl_f_apply_type)
Stacktrace:
 [1] Val
   @ ./essentials.jl:1037
 [2] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:18
 [3] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:21
 [4] #SharedMemory
   @ ~/.julia/dev/CUDA/src/CUDAKernels.jl:199
 [5] macro expansion
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:242
 [6] macro expansion
   @ ./REPL[2]:2
 [7] gpu_example
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/macros.jl:324
 [8] gpu_example
   @ ./none:0
Reason: unsupported call to an unknown function (call to ijl_new_structv)
Stacktrace:
 [1] Val
   @ ./essentials.jl:1035
 [2] Val
   @ ./essentials.jl:1037
 [3] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:18
 [4] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:21
 [5] #SharedMemory
   @ ~/.julia/dev/CUDA/src/CUDAKernels.jl:199
 [6] macro expansion
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:242
 [7] macro expansion
   @ ./REPL[2]:2
 [8] gpu_example
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/macros.jl:324
 [9] gpu_example
   @ ./none:0
Reason: unsupported dynamic function invocation (call to emit_shmem)
Stacktrace:
 [1] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:18
 [2] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:21
 [3] #SharedMemory
   @ ~/.julia/dev/CUDA/src/CUDAKernels.jl:199
 [4] macro expansion
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:242
 [5] macro expansion
   @ ./REPL[2]:2
 [6] gpu_example
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/macros.jl:324
 [7] gpu_example
   @ ./none:0
Reason: unsupported dynamic function invocation (call to CUDA.CuDeviceVector{UInt32, 3})
Stacktrace:
 [1] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:19
 [2] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:21
 [3] #SharedMemory
   @ ~/.julia/dev/CUDA/src/CUDAKernels.jl:199
 [4] macro expansion
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:242
 [5] macro expansion
   @ ./REPL[2]:2
 [6] gpu_example
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/macros.jl:324
 [7] gpu_example
   @ ./none:0
Reason: unsupported dynamic function invocation (call to CUDA.CuDeviceVector{UInt32, 3})
Stacktrace:
 [1] CuDeviceArray
   @ ~/.julia/dev/CUDA/src/device/array.jl:31
 [2] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:19
 [3] CuStaticSharedArray
   @ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:21
 [4] #SharedMemory
   @ ~/.julia/dev/CUDA/src/CUDAKernels.jl:199
 [5] macro expansion
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:242
 [6] macro expansion
   @ ./REPL[2]:2
 [7] gpu_example
   @ ~/.julia/packages/KernelAbstractions/lGrz7/src/macros.jl:324
 [8] gpu_example
   @ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erroneous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/validation.jl:167
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:385 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/Tracy/slmNc/src/tracepoint.jl:163 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:384 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/utils.jl:116
  [6] emit_llvm(job::GPUCompiler.CompilerJob)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/utils.jl:114
  [7] compile_unhooked(output::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:95
  [8] compile_unhooked
    @ ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:80 [inlined]
  [9] compile(target::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:67
 [10] compile
    @ ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:55 [inlined]
 [11] #1182
    @ ~/.julia/dev/CUDA/src/compiler/compilation.jl:250 [inlined]
 [12] JuliaContext(f::CUDA.var"#1182#1185"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:34
 [13] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:25
 [14] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/dev/CUDA/src/compiler/compilation.jl:249
 [15] actual_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/execution.jl:245
 [16] cached_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/execution.jl:159
 [17] macro expansion
    @ ~/.julia/dev/CUDA/src/compiler/execution.jl:373 [inlined]
 [18] macro expansion
    @ ./lock.jl:273 [inlined]
 [19] cufunction(f::typeof(gpu_example), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Val{0x01}}}; kwargs::@Kwargs{always_inline::Bool, maxthreads::Nothing})
    @ CUDA ~/.julia/dev/CUDA/src/compiler/execution.jl:368
 [20] cufunction
    @ ~/.julia/dev/CUDA/src/compiler/execution.jl:365 [inlined]
 [21] macro expansion
    @ ~/.julia/dev/CUDA/src/compiler/execution.jl:112 [inlined]
 [22] (::KernelAbstractions.Kernel{CUDA.CUDAKernels.CUDABackend, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_example)})(args::Val{0x01}; ndrange::Int64, workgroupsize::Nothing)
    @ CUDA.CUDAKernels ~/.julia/dev/CUDA/src/CUDAKernels.jl:124
 [23] top-level scope
    @ REPL[5]:1

コントリビューターガイド