AgentCoMa

AgentCoMa is an Agentic Commonsense and Math benchmark where each compositional task requires both commonsense and mathematical reasoning to be solved. The tasks are set in real-world scenarios: house working, web shopping, science experiments, smart assistant and travel agent. The benchmark is designed to test the mixed-type compositional reasoning abilities of LLMs. Contemporary LLMs perform well on commonsense and math reasoning in isolation, but are far less effective at solving AgentCoMa tasks that require their composition.

We measure the compositionality gap on AgentCoMa — i.e., the difference between the accuracy on the compositional tasks and the proportion of samples where all individual reasoning steps are answered correctly in isolation — and find it to be substantial across many recently-released LLMs, including reasoning models. Further details on how and why LLMs often struggle on AgentCoMa can be found in our paper.

We invite researchers to evaluate their frameworks on AgentCoMa and self-report their scores for inclusion in our leaderboard. For detailed evaluation instructions, see this README.

All steps correct = Percentage of samples where all individual reasoning steps are solved correctly in isolation. Composition = Accuracy on the compositional questions. Gap = Compositionality gap (the difference between 'All steps correct' and 'Composition').

Paper	Base model	All steps correct	Composition	Gap	Reported on
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	Phi4 Mini 3.8B IT	66.1%	35.6%	-30.5%	2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	Llama3.1 8B Instruct	68.3%	33.3%	-35.0%	2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	Qwen3 14B	88.9%	60.6%	-28.6%	2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	Phi3.5 MoE 42B IT	87.2%	61.7%	-25.5%	2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	Llama3.3 70B Instruct	90.0%	73.3%	-16.7%	2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	Mixtral MoE 141B	90.6%	66.1%	-24.5%	2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	Art 3B	57.8%	33.9%	-23.9%	2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	DeepSeekR1 8B	72.2%	34.4%	-37.8%	2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	Phi4 Reasoning 14.7B	91.7	62.2%	-29.5%	2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	DeepSeekR1 32B	90.0%	60.0%	-30.0%	2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	Reflection 70B	82.2%	65.6%	-16.6%	2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	GeneralReasoner 4B	73.9%	36.7%	-37.2%	2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	SimpleRL 8B	56.7%	25.0%	-31.7%	2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	GeneralReasoner 14B	80.0%	46.1%	-33.9%	2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	SimpleRL 32B	93.9%	66.7%	-27.2%	2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios	QVQ 72B	87.8%	56.7%	-31.1%	2025-08-10


@misc{alazraki2025agentcomacompositionalbenchmarkmixing,
      title={AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios}, 
      author={Lisa Alazraki and Lihu Chen and Ana Brassard and Joe Stacey and Hossein A. Rahmani and Marek Rei},
      year={2025},
      eprint={2508.19988},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.19988}, 
}