🏆 CRUXEval-X Leaderboard 🏆

CRUXEval-X evaluates the execution and reasoning ability of LLMs.

paper space code data

📝 Notes

  1. Thank you for your interest in CRUXEVAL-X. We warmly welcome researchers to submit additional benchmarking results, as we believe that collaborative efforts can significantly advance the study of Large Language Models and software engineering.
  2. Models are ranked according to pass@1 using greedy decoding.
  3. Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
  4. "Size" here is the amount of activated model weight during inference.

🤗 Acknowledgement

Thanks for the EvalPlus for sharing the leaderboard template. In addition to CRUXEVAL-X leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as: