Extensive benchmarking of reasoning models including variance · sgl-project/sglang#3725

(3 comments) (0 reactions) (1 assignee)Python (6,216 forks)auto 404

good first issue

Repository metrics

Stars: (28,442 stars)
PR merge metrics: (Avg merge 2d 1h) (1,000 merged PRs in 30d)

Description

In their R1 repo deepseek people recommend to estimate PASS@1 by asking the same question various times. We implemented that into our Reasoning benchmark. Additionaly to the averaged accuracy we report also average standard error as a measurement of uncertainty. Ideally that would include some plots.

Now we want to perform experiments how the results change under

repeated experiment with same hyperparameters
increased number of trials

I think the AIME 2024 is suited to this experiment because LIMO is quiet large and it will take long time to run experiments with a large number of trials.

Please see recently merged brach that includes measurement of uncertainty in reasoning models answers for more details and detailed explanation of the metrics.

Feel free to reach out to me if you have further questions.

Contributor guide

Research direction: Review the recently merged PR (#3677) and the existing reasoning benchmark code to understand how to add variance metrics and plotting for multiple trials.
Tech stack: python
Domain: machine learningbackend
Issue type: Feature
Difficulty: 3
Estimated time: 1-2 days
Activity status: Active
Clarity: Clear
Prerequisites: GitPythonBasic statistics
Newbie friendliness: 30

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.