MCPMark: New Standard for Evaluating AI Agents

If you, like me, wonder how LLMs work with MCP and how well they execute your assigned tasks, then a new study called MCPMark is exactly about that. The research shatters all illusions about artificial intelligence against the harsh reality.

Why Existing Tests Don’t Work

Imagine evaluating a person’s ability to work as a programmer by giving them only tasks to read documentation. Absurd, right? But that’s exactly how most existing benchmarks for AI agents work.

Researchers from the National University of Singapore, EvalSys, and other organizations noticed a critical problem: modern tests for evaluating AI agents’ work with Model Context Protocol (MCP) remain narrow and unrealistic. They either focus on tasks that only require reading information or offer interactions with minimal depth.

It’s like testing driving skills by having a person only sit in the passenger seat and describe what they see through the window.

What is MCPMark and Why It’s a Game Changer

MCPMark is not just another benchmark. It’s a stress test designed to evaluate AI agents in conditions as close to real work as possible.

Key Features of the Benchmark:

127 high-quality tasks created collaboratively by experts and AI agents. Each task:

Starts with a carefully selected initial state (database template, GitHub repository with history)
Requires performing diverse CRUD operations (Create, Read, Update, Delete)
Includes programmatic result verification — no subjective assessments

Five different testing environments:

Notion — document and database management
GitHub — repository operations, PRs, issues, CI/CD
Filesystem — working with files and directories
PostgreSQL — relational database operations
Playwright — browser automation and web interaction

Programmatic verification — similar to a reward system in GRPO with code compilation. Each task includes a program script for automatic result verification, making the evaluation as objective as possible.

Sobering Results

Top Models Fail

GPT-5-medium — the best of the tested models — successfully completed tasks on the first attempt in only 52.56% of cases. This means that in almost half the situations, the flagship model failed the assignment.

But the true depth of the problem is revealed in the pass^4 metric — it evaluates success after four attempts. Here, GPT-5-medium’s result drops to 33.86%. Even given four chances, the best model fails in two-thirds of cases!

Comparative Leaderboard:

Results from other strong models look even more dismal: Claude-Sonnet-4 and o3 show less than 30% success on the first attempt and below 15% — from four attempts.

It’s a Marathon, Not a Sprint: New Standard of Complexity

Tasks in MCPMark are not quick, single-step commands. The research shows that to solve one task, models require on average:

16.2 execution steps
17.4 external tool calls

This is radically different from previous tests where agents managed in 3-7 steps.

As the authors note:

“These metrics significantly exceed those of previous MCP benchmarks, underscoring MCPMark’s nature as a true stress test for AI agents.”

This indicates a transition from simple “question-answer” style queries to tasks requiring planning, adaptation, and real-time error correction — skills that have remained predominantly human.

Anatomy of Failure: Why Models Don’t Succeed

Researchers conducted a detailed analysis of failure causes and identified interesting patterns.

Implicit vs Explicit Failures

Most errors (more than 50-80% depending on the model) are implicit failures. The model completes the task without encountering explicit errors, but the result doesn’t meet requirements. This indicates problems with:

Reasoning and planning
Context understanding
Tool usage

Explicit failures include:

Context window overflow (especially in GPT-5-high)
Exceeding move limit (typical for Kimi-K2-instruct)
Premature stopping
Incorrect tool calls (about 10% in Gemini-2.5-flash)

More Moves ≠ Better Result

Interesting finding: more successful models manage with fewer targeted tool calls, rather than through blind trial and error.

For example, Kimi-K2-instruct often enters “overcalling” mode, exceeding 30 moves with decreasing probability of success — the model gets stuck in a loop without effective information retrieval.

Meanwhile, GPT-5-medium achieves the highest result while maintaining a reasonable move budget, demonstrating that success comes from efficient decision-making, not exhaustive tool calls.

Gap Between Local and Remote Services

Performance varies significantly depending on the type of MCP environment.

Local Services (easier):

PostgreSQL: GPT-5-medium achieves 76.19% pass@1
Filesystem: 57.50% pass@1
Playwright: 43.00% pass@1

Remote Services (harder):

Notion: most models below 25% pass@1
GitHub: similarly low results

Why such a gap? Local services are easier to simulate, and more training data exists for them. Remote API services require authentic interaction traces, which are expensive to collect at scale.

This suggests: data remains key to improving MCP usage.

Reasoning Effort: Does Thinking Longer Help?

Researchers tested how increasing “reasoning effort” affects results.

Findings by Model:

GPT-5: medium reasoning level increases pass@1 to 52.56% from 46.85% at low level.

GPT-5-mini: even stronger relative gain — from 8.27% to 30.32% between low and high.

GPT-5-nano: shows only marginal changes around 4-6%, suggesting that models of this scale don’t have sufficient capacity to utilize additional reasoning tokens.

Claude-Sonnet-4: remains stable around 27-28%, regardless of reasoning level.

Findings by Service:

Remote services benefit the most:

GitHub: performance nearly doubles from 27.17% to 50.00% between low and high effort for GPT-5
Notion: growth from 36.61% to 44.64%

Local services remain stable:

PostgreSQL: 72-76%
Filesystem: variations less than 5 percentage points

Interpretation: Remote services typically have limited representation in training data due to rate and access limitations. Reasoning helps bridge this gap, allowing models to extrapolate to unseen cases.

This aligns with recent discussions (Yao et al., 2023b; Yao, 2025) that emphasize “language generalizes through reasoning in agents.”

Cost ≠ Quality

Another surprising finding: more expensive runs don’t lead to higher accuracy.

Some of the most expensive runs achieve lower pass@1, while several cheaper runs achieve stronger results. Cost varies widely, even when the number of moves is similar.

Conclusion: Cost alone is not an indicator of better outcomes.

What This Means for the Future of AI Agents

MCPMark results serve as an important reminder: despite striking progress in language capabilities, we’re still in the early stages of creating truly autonomous and reliable AI agents.

Three Critical Directions for Future Progress:

1. From Reactive Tool Use to Complex Reasoning

Agents must evolve from simple reactions to queries to more sophisticated reasoning. Analysis shows that success depends on making fewer but smarter decisions, not on more attempts. Reasoning can provide better generalization in agents.

2. Long-term Task Execution Requires Contextual Efficiency

The problem is not just in the model’s context window, but in the agent’s ability to manage constantly growing history. This requires:

Better summarization strategies
More concrete tool outputs
Efficient memory management

3. Execution Stability is Critical

The observed inconsistency between multiple runs underscores fundamental unreliability that can only be addressed by building agents with:

Robust error handling
Self-correction capabilities
Deterministic behavior

Limitations and Future Directions

The study authors honestly acknowledge limitations:

Scaling complexity: Even with agents’ help, creating each sample remains time-consuming. Each task takes 3-5 hours of focused expert effort.

Complexity gradient: The steep difficulty of many tasks limits the benchmark’s usefulness for evaluating and developing smaller, more efficient models.

Determinism: All tasks have clear success criteria. In the real world, tasks with ambiguous user intent are often encountered, requiring agents to be able to ask clarifying questions.

Future work should focus on:

Introducing a more granular complexity gradient
Semi-automated task generation
Tasks with ambiguous intent
Expansion to a wider variety of MCP servers

How Tasks Were Created: Human-AI Collaboration in Action

The MCPMark creation process deserves special attention, as it demonstrates the future of benchmark development.

Four-stage Pipeline:

I. Exploration: Expert and task-creation agent collaboratively explore the environment, guided by high-level instructions based on expertise and real experience.

II. Evolvement: Agent proposes a new instruction or refines an existing one, adding complexity. Expert ensures the task remains practical, verifiable, and sufficiently challenging.

III. Verification: Agent creates a programmatic verification script. Expert executes the task with an execution agent, then the script is run and iteratively improved until fully matching the instruction.

IV. Iteration: Steps II and III are repeated to gradually increase complexity while maintaining automatic verifiability and realism.

The project involved 10 experts with diverse backgrounds:

PhD students in computer science
Front-end designers
Full-stack & AI infra engineers
AI investors

Conclusion: Reality vs Hype

MCPMark is not just another academic benchmark. It’s a mirror reflecting the true state of AI agents in 2025.

Key Takeaways:

Even the best models handle realistic tasks in less than 53% of cases

The gap between pass@1 and pass^4 shows a critical stability problem

Remote services remain significantly harder than local ones

More moves don’t equal better results — quality of reasoning matters

Reasoning effort helps, but is not a universal solution

Being able to speak eloquently is one thing, but flawlessly executing complex, multi-step tasks in a real digital environment is quite another.

The new benchmark clearly shows where the current boundaries of capabilities lie. And these boundaries are much closer than we’d like to think.

Links:

Paper: https://arxiv.org/abs/2509.24002
Project website: https://mcpmark.ai/
GitHub: https://github.com/eval-sys/mcpmark

Share it