h100: Worse output & 20x slower inference? · yl4579/StyleTTS2#89

(14 comments) (0 reactions) (0 assignees)Python (210 forks)batch import

help wanted

Repository metrics

We're testing finetuning on an h100 and 4090, here are the results:

Almost identical finetune, but h100 is output is SIGNIFICANTLY worse. It isn't a config issue, and we've replicated it twice with LJSpeech as well.

4090 is also faster during training and considerably faster during inference, almost 20x faster than h100:

Screenshot_2023-11-26_at_5 01 16_PM

h100:

Screenshot_2023-11-24_at_3 46 08_PM

And during training, one epoch took the 4090 about 3 minutes, while the h100 took 4.12 minutes.

Does anyone know what could be going on here? Never seen an issue like this on an h100 before with a diffusion like model. Thanks

Research direction: Compare GPU configurations, CUDA versions, and PyTorch builds between H100 and 4090. Profile GPU utilization and memory bandwidth using nvidia smi and PyTorch profiler. Check if the model is using Tensor Cores or if there is a misconfiguration like data type (FP16 vs FP32) causing slowdowns.
Tech stack: pythonpytorch
Domain: machine learningai
Issue type: Bug
Difficulty: 3
Estimated time: 3-5 days
Activity status: Active
Clarity: Mostly clear
Prerequisites: PythonPyTorch
Newbie friendliness: 30