JuliaGPU/CUDA.jl

Weird bug with matrix multiplication on Array view

Open

#2,821 opened on Jul 24, 2025

View on GitHub
 (0 comments) (0 reactions) (0 assignees)Julia (1,408 stars) (274 forks)batch import
cuda librarieshelp wanted

Description

Describe the bug

I observe that the matrix multiplication with Array view returns wrong result, for majority of type combinations, with no clear pattern, I can think of. Following are the examples:

To reproduce

The Minimal Working Example (MWE) for this bug:


julia> z = CUDA.rand(2,2,1)
2×2×1 CuArray{Float32, 3, CUDA.DeviceMemory}:
[:, :, 1] =
 0.615746  0.619146
 0.585899  0.0667595

julia> zf = Float64.(z)
2×2×1 CuArray{Float64, 3, CUDA.DeviceMemory}:
[:, :, 1] =
 0.615746  0.619146
 0.585899  0.0667595

julia> rot = CuArray(([1.0f0 0.0f0; 0.0f0 1.0f0]))
2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:
 1.0  0.0
 0.0  1.0

julia> rotf = Float64.(rot)
2×2 CuArray{Float64, 2, CUDA.DeviceMemory}:
 1.0  0.0
 0.0  1.0

julia> view(z,1,:,:)
2×1 view(::CuArray{Float32, 3, CUDA.DeviceMemory}, 1, :, :) with eltype Float32:
 0.61574584
 0.61914593

julia> view(zf,1,:,:)
2×1 view(::CuArray{Float64, 3, CUDA.DeviceMemory}, 1, :, :) with eltype Float64:
 0.6157458424568176
 0.619145929813385

# wrong result
julia> rot * view(z,1,:,:)
2×1 CuArray{Float32, 2, CUDA.DeviceMemory}:
 0.61574584
 0.5858987

# right result
julia> rotf * view(z,1,:,:)
2×1 CuArray{Float64, 2, CUDA.DeviceMemory}:
 0.6157458424568176
 0.619145929813385

# wrong result
julia> rot * view(zf,1,:,:)
2×1 CuArray{Float64, 2, CUDA.DeviceMemory}:
 0.6157458424568176
 0.5858986973762512

# wrong result
julia> rotf * view(zf,1,:,:)
2×1 CuArray{Float64, 2, CUDA.DeviceMemory}:
 0.6157458424568176
 0.5858986973762512

I am on [052768ef] CUDA v5.8.2

Version info

Details on Julia:

julia> versioninfo()

Julia Version 1.11.6
Commit 9615af0f269 (2025-07-09 12:58 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 96 × Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, cascadelake)
Threads: 1 default, 0 interactive, 1 GC (on 96 virtual cores)
Environment:
  JULIA_PKG_USE_CLI_GIT = true
  JULIA_DEPOT_PATH = /scratch/bt62/sn8885/.julia

Details on CUDA:

julia> CUDA.versioninfo()
CUDA runtime 12.9, artifact installation
CUDA driver 12.9
NVIDIA driver 570.124.6

CUDA libraries: 
- CUBLAS: 12.9.1
- CURAND: 10.3.10
- CUFFT: 11.4.1
- CUSOLVER: 11.7.5
- CUSPARSE: 12.5.10
- CUPTI: 2025.2.1 (API 28.0.0)
- NVML: 12.0.0+570.124.6

Julia packages: 
- CUDA: 5.8.2
- CUDA_Driver_jll: 0.13.1+0
- CUDA_Runtime_jll: 0.17.1+0

Toolchain:
- Julia: 1.11.6
- LLVM: 16.0.6

1 device:
  0: Tesla V100-SXM2-32GB (sm_70, 31.352 GiB / 32.000 GiB available)

Contributor guide