AgentCoMa is an Agentic Commonsense and Math benchmark where each compositional task requires both commonsense and mathematical reasoning to be solved. The tasks are set in real-world scenarios: house working, web shopping, science experiments, smart assistant and travel agent. The benchmark is designed to test the mixed-type compositional reasoning abilities of LLMs. Contemporary LLMs perform well on commonsense and math reasoning in isolation, but are far less effective at solving AgentCoMa tasks that require their composition.
We measure the compositionality gap on AgentCoMa — i.e., the difference between the accuracy on the compositional tasks and the proportion of samples where all individual reasoning steps are answered correctly in isolation — and find it to be substantial across many recently-released LLMs, including reasoning models. Further details on how and why LLMs often struggle on AgentCoMa can be found in our paper.
We invite researchers to evaluate their frameworks on AgentCoMa and self-report their scores for inclusion in our leaderboard. For detailed evaluation instructions, see this README.
All steps correct = Percentage of samples where all individual reasoning steps are solved correctly in isolation.
Composition = Accuracy on the compositional questions.
Gap = Compositionality gap (the difference between 'All steps correct' and 'Composition').
Citation
@misc{alazraki2025agentcomacompositionalbenchmarkmixing,
title={AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios},
author={Lisa Alazraki and Lihu Chen and Ana Brassard and Joe Stacey and Hossein A. Rahmani and Marek Rei},
year={2025},
eprint={2508.19988},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.19988},
}