[Bug] RuntimeError: CUDA error: an illegal instruction was encountered · RVC-Project/Retrieval-based-Voice-Conversion-WebUI#215

(7 comments) (0 reactions) (0 assignees)Python (2,849 forks)batch import

bughelp wantedquestion

Repository metrics

Stars: (18,427 stars)
PR merge metrics: (No merged PRs in 30d)

Description

Using an Ubuntu system, 2x3060 (12g ea) and the latest version of RVC, commit c4a1810

During training, after a few epochs complete, a CUDA error is thrown:


INFO:user-test-3:====> Epoch: 1
INFO:user-test-3:Train Epoch: 2 [11%]
INFO:user-test-3:[200, 9.99875e-05]
INFO:user-test-3:loss_disc=3.124, loss_gen=2.644, loss_fm=8.702,loss_mel=19.773, loss_kl=1.555
INFO:user-test-3:====> Epoch: 2
INFO:user-test-3:Train Epoch: 3 [22%]
INFO:user-test-3:[400, 9.99750015625e-05]
INFO:user-test-3:loss_disc=3.009, loss_gen=2.687, loss_fm=8.580,loss_mel=19.066, loss_kl=1.653
INFO:user-test-3:====> Epoch: 3
INFO:user-test-3:Train Epoch: 4 [33%]
INFO:user-test-3:[600, 9.996250468730469e-05]
INFO:user-test-3:loss_disc=3.033, loss_gen=2.489, loss_fm=7.798,loss_mel=18.964, loss_kl=1.770
INFO:user-test-3:====> Epoch: 4
INFO:user-test-3:Train Epoch: 5 [44%]
INFO:user-test-3:[800, 9.995000937421877e-05]
INFO:user-test-3:loss_disc=2.957, loss_gen=2.745, loss_fm=7.675,loss_mel=18.730, loss_kl=1.756
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f37fab9e4d7 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f37fab6836b in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f38008b6fa8 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xdf9d4e (0x7f378a7f9d4e in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x4ccea6 (0x7f37c90ccea6 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3ee77 (0x7f37fab83e77 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f37fab7c69e in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f37fab7c7b9 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x752458 (0x7f37c9352458 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x305 (0x7f37c93527e5 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x12c1dc (0x55db69db61dc in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #11: <unknown function> + 0x154b6f (0x55db69ddeb6f in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #12: <unknown function> + 0x167367 (0x55db69df1367 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #13: <unknown function> + 0x167394 (0x55db69df1394 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #14: <unknown function> + 0x167394 (0x55db69df1394 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #15: <unknown function> + 0x171a2c (0x55db69dfba2c in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #16: <unknown function> + 0x132719 (0x55db69dbc719 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #17: <unknown function> + 0x272015 (0x55db69efc015 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x5ae7 (0x55db69dd79e7 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #19: _PyFunction_Vectorcall + 0x79 (0x55db69de7ff9 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x8c2 (0x55db69dd27c2 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #21: _PyFunction_Vectorcall + 0x79 (0x55db69de7ff9 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x6d0 (0x55db69dd25d0 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #23: _PyFunction_Vectorcall + 0x79 (0x55db69de7ff9 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x197b (0x55db69dd387b in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #25: <unknown function> + 0x144cb4 (0x55db69dcecb4 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #26: PyEval_EvalCode + 0x86 (0x55db69ebb266 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #27: <unknown function> + 0x25d497 (0x55db69ee7497 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #28: <unknown function> + 0x25645e (0x55db69ee045e in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #29: PyRun_StringFlags + 0x81 (0x55db69ed8a71 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #30: PyRun_SimpleStringFlags + 0x3c (0x55db69ed894c in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #31: Py_RunMain + 0x377 (0x55db69ed7ae7 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #32: Py_BytesMain + 0x2b (0x55db69eaf38b in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)
frame #33: <unknown function> + 0x23510 (0x7f38a3823510 in /lib/x86_64-linux-gnu/libc.so.6)
frame #34: __libc_start_main + 0x89 (0x7f38a38235c9 in /lib/x86_64-linux-gnu/libc.so.6)
frame #35: _start + 0x25 (0x55db69eaf285 in /home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/bin/python)

Traceback (most recent call last):
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 534, in <module>
    main()
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 50, in main
    mp.spawn(
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 202, in run
    train_and_evaluate(
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/train_nsf_sim_cache_sid_load_pretrain.py", line 389, in train_and_evaluate
    wave = commons.slice_segments(
  File "/home/user/rvc-test/Retrieval-based-Voice-Conversion-WebUI/infer_pack/commons.py", line 49, in slice_segments
    ret[i] = x[i, :, idx_str:idx_end]
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Contributor guide

Research direction: Run training with CUDA LAUNCH BLOCKING=1 to get precise CUDA error location. Check GPU memory usage, reduce batch size, update PyTorch version, and verify GPU stability (e.g., run stress test). Reproduce with single GPU.
Tech stack: pythonpytorch
Domain: machine learningai
Issue type: Bug
Difficulty: 3
Estimated time: Half day
Activity status: Active
Clarity: Clear
Prerequisites: pythonpytorchcuda basics
Newbie friendliness: 40

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.