Extensive benchmarking of reasoning models including variance · sgl-project/sglang#3725

(3 评论) (0 反应) (1 负责人)Python (6,216 fork)auto 404

good first issue

仓库指标

Star: (28,442 star)
PR 合并指标: (平均合并 2天 1小时) (30 天内合并 1,000 个 PR)

描述

In their R1 repo deepseek people recommend to estimate PASS@1 by asking the same question various times. We implemented that into our Reasoning benchmark. Additionaly to the averaged accuracy we report also average standard error as a measurement of uncertainty. Ideally that would include some plots.

Now we want to perform experiments how the results change under

repeated experiment with same hyperparameters
increased number of trials

I think the AIME 2024 is suited to this experiment because LIMO is quiet large and it will take long time to run experiments with a large number of trials.

Please see recently merged brach that includes measurement of uncertainty in reasoning models answers for more details and detailed explanation of the metrics.

Feel free to reach out to me if you have further questions.

贡献者指南

研究方向: 查看最近合并的 PR (#3677) 和现有的推理基准测试代码，以了解如何为多次试验添加方差指标和绘图。
技术栈: python
领域: machine learningbackend
议题类型: 功能
难度: 3
预计时间: 1-2 天
活动状态: 活跃
清晰度: 清晰
前置要求: GitPythonBasic statistics
新手友好度: 30

仓库指标

描述

贡献者指南

每天在邮箱收到新鲜 Easy issues。