Extensive benchmarking of reasoning models including variance · sgl-project/sglang#3725

(3 commenti) (0 reazioni) (1 assegnatario)Python (6216 fork)auto 404

good first issue

Metriche repository

Star: (28.442 star)
Metriche merge PR: (Merge medio 2g 1h) (1000 PR mergiate in 30 g)

Descrizione

In their R1 repo deepseek people recommend to estimate PASS@1 by asking the same question various times. We implemented that into our Reasoning benchmark. Additionaly to the averaged accuracy we report also average standard error as a measurement of uncertainty. Ideally that would include some plots.

Now we want to perform experiments how the results change under

repeated experiment with same hyperparameters
increased number of trials

I think the AIME 2024 is suited to this experiment because LIMO is quiet large and it will take long time to run experiments with a large number of trials.

Please see recently merged brach that includes measurement of uncertainty in reasoning models answers for more details and detailed explanation of the metrics.

Feel free to reach out to me if you have further questions.

Guida contributor

Direzione di ricerca: Rivedi la PR recentemente unita (#3677) e il codice benchmark di ragionamento esistente per capire come aggiungere metriche di varianza e grafici per più prove.
Tech stack: python
Dominio: machine learningbackend
Tipo issue: Funzionalità
Difficoltà: 3
Tempo stimato: 1-2 giorni
Stato attività: Attiva
Chiarezza: Chiara
Prerequisiti: GitPythonBasic statistics
Adatta ai principianti: 30

Metriche repository

Descrizione

Guida contributor

Ricevi issue Easy fresche nella tua inbox.