AgentCoMa Robot

AgentCoMa

A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios
Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani, Marek Rei
Download data from Hugging Face Requires HF login + sharing your email and username Evaluated on AgentCoMa in your paper? Self-report results and get added to the leaderboard ✨

AgentCoMa is an Agentic Commonsense and Math benchmark where each compositional task requires both commonsense and mathematical reasoning to be solved. The tasks are set in real-world scenarios: house working, web shopping, science experiments, smart assistant and travel agent. The benchmark is designed to test the mixed-type compositional reasoning abilities of LLMs. Contemporary LLMs perform well on commonsense and math reasoning in isolation, but are far less effective at solving AgentCoMa tasks that require their composition.

We measure the compositionality gap on AgentCoMa — i.e., the difference between the accuracy on the compositional tasks and the proportion of samples where all individual reasoning steps are answered correctly in isolation — and find it to be substantial across many recently-released LLMs, including reasoning models. Further details on how and why LLMs often struggle on AgentCoMa can be found in our paper.

We invite researchers to evaluate their frameworks on AgentCoMa and self-report their scores for inclusion in our leaderboard. For detailed evaluation instructions, see this README.

Leaderboard

All steps correct = Percentage of samples where all individual reasoning steps are solved correctly in isolation. Composition = Accuracy on the compositional questions. Gap = Compositionality gap (the difference between 'All steps correct' and 'Composition').

Paper Base model All steps correct Composition Gap Reported on
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios Phi4 Mini 3.8B IT 66.1% 35.6% -30.5% 2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios Llama3.1 8B Instruct 68.3% 33.3% -35.0% 2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios Qwen3 14B 88.9% 60.6% -28.6% 2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios Phi3.5 MoE 42B IT 87.2% 61.7% -25.5% 2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios Llama3.3 70B Instruct 90.0% 73.3% -16.7% 2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios Mixtral MoE 141B 90.6% 66.1% -24.5% 2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios Art 3B 57.8% 33.9% -23.9% 2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios DeepSeekR1 8B 72.2% 34.4% -37.8% 2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios Phi4 Reasoning 14.7B 91.7 62.2% -29.5% 2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios DeepSeekR1 32B 90.0% 60.0% -30.0% 2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios Reflection 70B 82.2% 65.6% -16.6% 2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios GeneralReasoner 4B 73.9% 36.7% -37.2% 2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios SimpleRL 8B 56.7% 25.0% -31.7% 2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios GeneralReasoner 14B 80.0% 46.1% -33.9% 2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios SimpleRL 32B 93.9% 66.7% -27.2% 2025-08-10
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios QVQ 72B 87.8% 56.7% -31.1% 2025-08-10

Citation


@misc{alazraki2025agentcomacompositionalbenchmarkmixing,
      title={AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios}, 
      author={Lisa Alazraki and Lihu Chen and Ana Brassard and Joe Stacey and Hossein A. Rahmani and Marek Rei},
      year={2025},
      eprint={2508.19988},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.19988}, 
}
  
Imperial College Logo Riken Logo UCL Logo