CRUXEval-X Leaderboard

🏆 CRUXEval-X Leaderboard 🏆

CRUXEval-X evaluates the execution and reasoning ability of LLMs.

Input Reasoning

Output Reasoning

Thank you for your interest in CRUXEVAL-X. We warmly welcome researchers to submit additional benchmarking results, as we believe that collaborative efforts can significantly advance the study of Large Language Models and software engineering. For submission guidelines, please refer to our Github Repository Submission Guide.
Models are ranked according to pass@1 using greedy decoding.
Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
"Size" here is the amount of activated model weight during inference.

🏆 CRUXEval-X Leaderboard 🏆