cuda kernelsgood first issue
説明
Bug description
When attempting to allocate shared memory for use inside a kernel, it appears that the type information of the specified dimensions can influence whether the compilation proceeds correctly or produces an InvalidIRError.
Minimal working example
This simple snippet demonstrates the issue quite succinctly.
import CUDA, KernelAbstractions as KA
KA.@kernel function example(::Val{shmem_dims}) where {shmem_dims}
shmem = KA.@localmem UInt32 shmem_dims
KA.@index(Global, Linear) == 1 && KA.@print(size(shmem), "\n")
end
instant = example(CUDA.CUDABackend())
instant(Val(1); ndrange = 1); CUDA.synchronize(); # Output: (1,)
instant(Val(0x1); ndrange = 1); CUDA.synchronize(); # Output: InvalidIRError
[[deps.CUDA]]
version = "5.8.3" (Issue also present in the development branch at the time of commit d6ad9c3)
[[deps.GPUArrays]]
version = "11.2.3"
[[deps.GPUArraysCore]]
version = "0.2.0"
[[deps.GPUCompiler]]
version = "1.6.1"
[[deps.LLVM]]
version = "9.4.2"
Expected behaviour
The allocation of shared memory should only depend on the integral value(s) of the dimension argument(s) and not their underlying type.
Version information
Julia:
Julia Version 1.11.6
Commit 9615af0f269 (2025-07-09 12:58 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 12 × Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 12 virtual cores)
CUDA:
CUDA toolchain:
- runtime 12.9, artifact installation
- driver 535.247.1 for 12.2
- compiler 12.9
CUDA libraries:
- CUBLAS: 12.9.1
- CURAND: 10.3.10
- CUFFT: 11.4.1
- CUSOLVER: 11.7.5
- CUSPARSE: 12.5.10
- CUPTI: 2025.2.1 (API 12.9.1)
- NVML: 12.0.0+535.247.1
Julia packages:
- CUDA: 5.8.3
- CUDA_Driver_jll: 13.0.0+0
- CUDA_Compiler_jll: 0.2.0+1
- CUDA_Runtime_jll: 0.19.0+0
Toolchain:
- Julia: 1.11.6
- LLVM: 16.0.6
1 device:
0: NVIDIA GeForce RTX 2070 (sm_75, 7.677 GiB / 8.000 GiB available)
Trace
The aforementioned minimal working example produces the following stack trace with show(err).
1-element ExceptionStack:
InvalidIRError: compiling MethodInstance for gpu_example(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, ::Val{0x01}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jl_f_apply_type)
Stacktrace:
[1] Val
@ ./essentials.jl:1037
[2] CuStaticSharedArray
@ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:18
[3] CuStaticSharedArray
@ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:21
[4] #SharedMemory
@ ~/.julia/dev/CUDA/src/CUDAKernels.jl:199
[5] macro expansion
@ ~/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:242
[6] macro expansion
@ ./REPL[2]:2
[7] gpu_example
@ ~/.julia/packages/KernelAbstractions/lGrz7/src/macros.jl:324
[8] gpu_example
@ ./none:0
Reason: unsupported call to an unknown function (call to ijl_new_structv)
Stacktrace:
[1] Val
@ ./essentials.jl:1035
[2] Val
@ ./essentials.jl:1037
[3] CuStaticSharedArray
@ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:18
[4] CuStaticSharedArray
@ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:21
[5] #SharedMemory
@ ~/.julia/dev/CUDA/src/CUDAKernels.jl:199
[6] macro expansion
@ ~/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:242
[7] macro expansion
@ ./REPL[2]:2
[8] gpu_example
@ ~/.julia/packages/KernelAbstractions/lGrz7/src/macros.jl:324
[9] gpu_example
@ ./none:0
Reason: unsupported dynamic function invocation (call to emit_shmem)
Stacktrace:
[1] CuStaticSharedArray
@ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:18
[2] CuStaticSharedArray
@ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:21
[3] #SharedMemory
@ ~/.julia/dev/CUDA/src/CUDAKernels.jl:199
[4] macro expansion
@ ~/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:242
[5] macro expansion
@ ./REPL[2]:2
[6] gpu_example
@ ~/.julia/packages/KernelAbstractions/lGrz7/src/macros.jl:324
[7] gpu_example
@ ./none:0
Reason: unsupported dynamic function invocation (call to CUDA.CuDeviceVector{UInt32, 3})
Stacktrace:
[1] CuStaticSharedArray
@ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:19
[2] CuStaticSharedArray
@ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:21
[3] #SharedMemory
@ ~/.julia/dev/CUDA/src/CUDAKernels.jl:199
[4] macro expansion
@ ~/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:242
[5] macro expansion
@ ./REPL[2]:2
[6] gpu_example
@ ~/.julia/packages/KernelAbstractions/lGrz7/src/macros.jl:324
[7] gpu_example
@ ./none:0
Reason: unsupported dynamic function invocation (call to CUDA.CuDeviceVector{UInt32, 3})
Stacktrace:
[1] CuDeviceArray
@ ~/.julia/dev/CUDA/src/device/array.jl:31
[2] CuStaticSharedArray
@ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:19
[3] CuStaticSharedArray
@ ~/.julia/dev/CUDA/src/device/intrinsics/shared_memory.jl:21
[4] #SharedMemory
@ ~/.julia/dev/CUDA/src/CUDAKernels.jl:199
[5] macro expansion
@ ~/.julia/packages/KernelAbstractions/lGrz7/src/KernelAbstractions.jl:242
[6] macro expansion
@ ./REPL[2]:2
[7] gpu_example
@ ~/.julia/packages/KernelAbstractions/lGrz7/src/macros.jl:324
[8] gpu_example
@ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erroneous code with Cthulhu.jl
Stacktrace:
[1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
@ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/validation.jl:167
[2] macro expansion
@ ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:385 [inlined]
[3] macro expansion
@ ~/.julia/packages/Tracy/slmNc/src/tracepoint.jl:163 [inlined]
[4] macro expansion
@ ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:384 [inlined]
[5] emit_llvm(job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
@ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/utils.jl:116
[6] emit_llvm(job::GPUCompiler.CompilerJob)
@ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/utils.jl:114
[7] compile_unhooked(output::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
@ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:95
[8] compile_unhooked
@ ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:80 [inlined]
[9] compile(target::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
@ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:67
[10] compile
@ ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:55 [inlined]
[11] #1182
@ ~/.julia/dev/CUDA/src/compiler/compilation.jl:250 [inlined]
[12] JuliaContext(f::CUDA.var"#1182#1185"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}; kwargs::@Kwargs{})
@ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:34
[13] JuliaContext(f::Function)
@ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/driver.jl:25
[14] compile(job::GPUCompiler.CompilerJob)
@ CUDA ~/.julia/dev/CUDA/src/compiler/compilation.jl:249
[15] actual_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
@ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/execution.jl:245
[16] cached_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
@ GPUCompiler ~/.julia/packages/GPUCompiler/Ecaql/src/execution.jl:159
[17] macro expansion
@ ~/.julia/dev/CUDA/src/compiler/execution.jl:373 [inlined]
[18] macro expansion
@ ./lock.jl:273 [inlined]
[19] cufunction(f::typeof(gpu_example), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Val{0x01}}}; kwargs::@Kwargs{always_inline::Bool, maxthreads::Nothing})
@ CUDA ~/.julia/dev/CUDA/src/compiler/execution.jl:368
[20] cufunction
@ ~/.julia/dev/CUDA/src/compiler/execution.jl:365 [inlined]
[21] macro expansion
@ ~/.julia/dev/CUDA/src/compiler/execution.jl:112 [inlined]
[22] (::KernelAbstractions.Kernel{CUDA.CUDAKernels.CUDABackend, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_example)})(args::Val{0x01}; ndrange::Int64, workgroupsize::Nothing)
@ CUDA.CUDAKernels ~/.julia/dev/CUDA/src/CUDAKernels.jl:124
[23] top-level scope
@ REPL[5]:1