Performance issues compared to Pytorch · huggingface/candle#1139

Repository metrics

Stars: (19,476 stars)
PR merge metrics: (Avg merge 14d 18h) (27 merged PRs in 30d)

Description

Hello. I mentioned this in the discord and worked with a member to make sure I wasn't doing anything dumb. I tested the release version of my candle code with cudnn enabled vs equivalent pytorch code, and comparatively candle is about 4x slower.

I have attached the code I was using to compare. It contains both the original python/pytorch implementation of the RealESRGAN RRDBNet arch, as well as my Candle implementation.

I'm limited on time or I would have set up a proper repo for this with a script/program that would run both tests automatically, but this is the best I can do at the moment. In order to use either script, you'll probably have to adjust the paths in each script to match the path of the model and your test images (I did not include test images). I recommend trying ~10 smallish images (128x for example).

Context from discord: https://discord.com/channels/879548962464493619/1136218819447238726/1164985040854339736

Code: rust_candle_test.zip

And the model (had to upload to drive) https://drive.google.com/file/d/1AyvArWkR3qonMV2pBtk3zDkct0yJrh5Z/view?usp=sharing

For reference, here is the results of when I benchmarked it:

PyTorch:

Model took 323.999ms // First run, takes a long time
Saved 00000.png
Model took 35.4905ms // Second run, and subsequent runs after, take significantly less
Saved 00001.png

Candle:

Model took 262.1319ms // First run, takes a while but less time than torch
Saved 00000.png
Model took 124.97ms // Second run, and subsequent runs after, takes less but still far more than pytorch
Saved 00001.png

Please let me know if you need or want any more information. Candle is a very interesting project and seems very promising. It just currently doesn't seem to have as much optimization as pytorch.

Contributor guide

Research direction: Compare the candle and PyTorch implementations to identify performance bottlenecks. Focus on kernel launch overhead, memory allocation patterns, and operation fusion. Check if the CUDNN graph is being used effectively in candle.
Tech stack: rust
Domain: machine learningai
Issue type: Performance
Difficulty: 3
Estimated time: Half day
Activity status: Active
Clarity: Clear
Prerequisites: RustPyTorchDeep Learning
Newbie friendliness: 40

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.