Speed up RNNT model inference using TRT · NVIDIA-NeMo/NeMo#14531

(1 comment) (0 reactions) (1 assignee)Python (3,421 forks)github user discovery

ASRcommunity-requesthelp wantedwaiting-on-customer

Repository metrics

Stars: (17,298 stars)
PR merge metrics: (Avg merge 12d) (49 merged PRs in 30d)

Description

Hi,

I previously trained an RNNT model and now want to accelerate it by converting it to TensorRT. I’ve exported the model to ONNX and have encoder.onnx and decoder.onnx.

I’m using the TensorRT 25.03 Docker image and trtexec to convert the models. The decoder works fine with --fp16, but when I use --fp16 for the encoder, some outputs return NaN and the results are incorrect.

Has anyone encountered this issue or knows how to fix it?

Are there any methods to accelerate RNNT model inference?

Contributor guide

Research direction: Investigate why FP16 inference fails for the encoder model. Check TensorRT logs for warnings, verify operator support for FP16, consider using INT8 quantization or fp16 with best flag. Also examine the ONNX export for problematic ops.
Tech stack: python
Domain: aiperformance
Issue type: Bug
Difficulty: 3
Estimated time: 1-2 days
Activity status: Active
Clarity: Clear
Prerequisites: PythonTensorRTONNX
Newbie friendliness: 40

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.