Repository metrics
- Stars
- (3,429 stars)
- PR merge metrics
- (No merged PRs in 30d)
Description
We're testing finetuning on an h100 and 4090, here are the results:
4090: https://voca.ro/11mtxzLHzzih h100: https://voca.ro/15QldVjuG7nu
Almost identical finetune, but h100 is output is SIGNIFICANTLY worse. It isn't a config issue, and we've replicated it twice with LJSpeech as well.
4090 is also faster during training and considerably faster during inference, almost 20x faster than h100:
h100:
And during training, one epoch took the 4090 about 3 minutes, while the h100 took 4.12 minutes.
Does anyone know what could be going on here? Never seen an issue like this on an h100 before with a diffusion like model. Thanks