🏆 CRUXEval-X Leaderboard 🏆
CRUXEval-X evaluates the execution and reasoning ability of LLMs.
📝 Notes
- Thank you for your interest in CRUXEVAL-X. We warmly welcome researchers to submit additional benchmarking results, as we believe that collaborative efforts can significantly advance the study of Large Language Models and software engineering.
- Models are ranked according to pass@1 using greedy decoding.
- Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
- "Size" here is the amount of activated model weight during inference.
🤗 Acknowledgement
Thanks for the EvalPlus for sharing the leaderboard template. In addition to CRUXEVAL-X leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:- EvalPlus Leaderboard
- JavaBench
- BigCodeBench
- Big Code Models Leaderboard
- Chatbot Arena Leaderboard
- CrossCodeEval
- ClassEval
- CRUXEval
- Code Lingua
- Evo-Eval
- EffiBench
- HumanEval.jl - Julia version HumanEval with EvalPlus test cases
- LiveCodeBench
- MHPP
- NaturalCodeBench
- RepoBench
- SWE-bench
- TabbyML Leaderboard
- TestEval