Data Science, ML and Analytics Engineering

MCPMark: New Standard for Evaluating AI Agents

If you, like me, wonder how LLMs work with MCP and how well they execute your assigned tasks, then a new study called MCPMark is exactly about that. The research shatters all illusions about artificial intelligence against the harsh reality.

Why Existing Tests Don’t Work

Imagine evaluating a person’s ability to work as a programmer by giving them only tasks to read documentation. Absurd, right? But that’s exactly how most existing benchmarks for AI agents work.

Researchers from the National University of Singapore, EvalSys, and other organizations noticed a critical problem: modern tests for evaluating AI agents’ work with Model Context Protocol (MCP) remain narrow and unrealistic. They either focus on tasks that only require reading information or offer interactions with minimal depth.

It’s like testing driving skills by having a person only sit in the passenger seat and describe what they see through the window.

What is MCPMark and Why It’s a Game Changer

MCPMark is not just another benchmark. It’s a stress test designed to evaluate AI agents in conditions as close to real work as possible.

MCPMark

Key Features of the Benchmark:

127 high-quality tasks created collaboratively by experts and AI agents. Each task:

  • Starts with a carefully selected initial state (database template, GitHub repository with history)
  • Requires performing diverse CRUD operations (Create, Read, Update, Delete)
  • Includes programmatic result verification — no subjective assessments

Five different testing environments:

  • Notion — document and database management
  • GitHub — repository operations, PRs, issues, CI/CD
  • Filesystem — working with files and directories
  • PostgreSQL — relational database operations
  • Playwright — browser automation and web interaction
MCPMark

Programmatic verification — similar to a reward system in GRPO with code compilation. Each task includes a program script for automatic result verification, making the evaluation as objective as possible.

Sobering Results

Top Models Fail

GPT-5-medium — the best of the tested models — successfully completed tasks on the first attempt in only 52.56% of cases. This means that in almost half the situations, the flagship model failed the assignment.

But the true depth of the problem is revealed in the pass^4 metric — it evaluates success after four attempts. Here, GPT-5-medium’s result drops to 33.86%. Even given four chances, the best model fails in two-thirds of cases!

Comparative Leaderboard:

MCPMark Results

Results from other strong models look even more dismal: Claude-Sonnet-4 and o3 show less than 30% success on the first attempt and below 15% — from four attempts.

It’s a Marathon, Not a Sprint: New Standard of Complexity

Tasks in MCPMark are not quick, single-step commands. The research shows that to solve one task, models require on average:

  • 16.2 execution steps
  • 17.4 external tool calls

This is radically different from previous tests where agents managed in 3-7 steps.

As the authors note:

“These metrics significantly exceed those of previous MCP benchmarks, underscoring MCPMark’s nature as a true stress test for AI agents.”

This indicates a transition from simple “question-answer” style queries to tasks requiring planning, adaptation, and real-time error correction — skills that have remained predominantly human.

More Than Just Reading: AI Must Create, Modify, and Delete

The key difference between MCPMark and its predecessors is the nature of the tasks.

Old tests focused on read-heavy tasks or those with limited interaction depth. The new benchmark requires agents to perform the full spectrum of CRUD operations:

  • Create — creating new records, files, PRs
  • Read — reading and analyzing information
  • Update — updating existing data
  • Delete — removing outdated information

Examples of Real Tasks from MCPMark:

Filesystem — “Contact Information”: Extract contact information from various file formats on the desktop and perform analysis of collected relationship data.

GitHub — “Linting CI Workflow”: Configure ESLint workflow to ensure code quality for all PRs with proper CI integration. Includes creating a configuration branch, setting up ESLint, creating workflow, and fixing linting errors.

Notion — “Toronto Guide”: Change all pink-colored elements (tags in databases and callout background colors) to other colors on the “Toronto Guide” page.

Playwright — “Cloudflare Turnstile Challenge”: Use Playwright MCP tools to pass Cloudflare Turnstile authentication.

PostgreSQL — “Employee Project Tracking”: Build a tracking system with tables for projects, assignments, milestones, and performance indices with foreign keys and initial data.

This capability is not just a technical detail. It’s fundamental to creating truly “general agents” capable of actively and meaningfully interacting with external systems rather than passively retrieving information.

Anatomy of Failure: Why Models Don’t Succeed

Researchers conducted a detailed analysis of failure causes and identified interesting patterns.

Implicit vs Explicit Failures

Most errors (more than 50-80% depending on the model) are implicit failures. The model completes the task without encountering explicit errors, but the result doesn’t meet requirements. This indicates problems with:

  • Reasoning and planning
  • Context understanding
  • Tool usage

Explicit failures include:

  • Context window overflow (especially in GPT-5-high)
  • Exceeding move limit (typical for Kimi-K2-instruct)
  • Premature stopping
  • Incorrect tool calls (about 10% in Gemini-2.5-flash)

More Moves ≠ Better Result

Interesting finding: more successful models manage with fewer targeted tool calls, rather than through blind trial and error.

For example, Kimi-K2-instruct often enters “overcalling” mode, exceeding 30 moves with decreasing probability of success — the model gets stuck in a loop without effective information retrieval.

Meanwhile, GPT-5-medium achieves the highest result while maintaining a reasonable move budget, demonstrating that success comes from efficient decision-making, not exhaustive tool calls.

Gap Between Local and Remote Services

Performance varies significantly depending on the type of MCP environment.

Local Services (easier):

  • PostgreSQL: GPT-5-medium achieves 76.19% pass@1
  • Filesystem: 57.50% pass@1
  • Playwright: 43.00% pass@1

Remote Services (harder):

  • Notion: most models below 25% pass@1
  • GitHub: similarly low results

Why such a gap? Local services are easier to simulate, and more training data exists for them. Remote API services require authentic interaction traces, which are expensive to collect at scale.

This suggests: data remains key to improving MCP usage.

Reasoning Effort: Does Thinking Longer Help?

Researchers tested how increasing “reasoning effort” affects results.

Findings by Model:

GPT-5: medium reasoning level increases pass@1 to 52.56% from 46.85% at low level.

GPT-5-mini: even stronger relative gain — from 8.27% to 30.32% between low and high.

GPT-5-nano: shows only marginal changes around 4-6%, suggesting that models of this scale don’t have sufficient capacity to utilize additional reasoning tokens.

Claude-Sonnet-4: remains stable around 27-28%, regardless of reasoning level.

Findings by Service:

Remote services benefit the most:

  • GitHub: performance nearly doubles from 27.17% to 50.00% between low and high effort for GPT-5
  • Notion: growth from 36.61% to 44.64%

Local services remain stable:

  • PostgreSQL: 72-76%
  • Filesystem: variations less than 5 percentage points

Interpretation: Remote services typically have limited representation in training data due to rate and access limitations. Reasoning helps bridge this gap, allowing models to extrapolate to unseen cases.

This aligns with recent discussions (Yao et al., 2023b; Yao, 2025) that emphasize “language generalizes through reasoning in agents.”

Cost ≠ Quality

MCPMark Costs

Another surprising finding: more expensive runs don’t lead to higher accuracy.

Some of the most expensive runs achieve lower pass@1, while several cheaper runs achieve stronger results. Cost varies widely, even when the number of moves is similar.

Conclusion: Cost alone is not an indicator of better outcomes.

What This Means for the Future of AI Agents

MCPMark results serve as an important reminder: despite striking progress in language capabilities, we’re still in the early stages of creating truly autonomous and reliable AI agents.

Three Critical Directions for Future Progress:

1. From Reactive Tool Use to Complex Reasoning

Agents must evolve from simple reactions to queries to more sophisticated reasoning. Analysis shows that success depends on making fewer but smarter decisions, not on more attempts. Reasoning can provide better generalization in agents.

2. Long-term Task Execution Requires Contextual Efficiency

The problem is not just in the model’s context window, but in the agent’s ability to manage constantly growing history. This requires:

  • Better summarization strategies
  • More concrete tool outputs
  • Efficient memory management

3. Execution Stability is Critical

The observed inconsistency between multiple runs underscores fundamental unreliability that can only be addressed by building agents with:

  • Robust error handling
  • Self-correction capabilities
  • Deterministic behavior

Limitations and Future Directions

The study authors honestly acknowledge limitations:

Scaling complexity: Even with agents’ help, creating each sample remains time-consuming. Each task takes 3-5 hours of focused expert effort.

Complexity gradient: The steep difficulty of many tasks limits the benchmark’s usefulness for evaluating and developing smaller, more efficient models.

Determinism: All tasks have clear success criteria. In the real world, tasks with ambiguous user intent are often encountered, requiring agents to be able to ask clarifying questions.

Future work should focus on:

  • Introducing a more granular complexity gradient
  • Semi-automated task generation
  • Tasks with ambiguous intent
  • Expansion to a wider variety of MCP servers

How Tasks Were Created: Human-AI Collaboration in Action

The MCPMark creation process deserves special attention, as it demonstrates the future of benchmark development.

Four-stage Pipeline:

I. Exploration: Expert and task-creation agent collaboratively explore the environment, guided by high-level instructions based on expertise and real experience.

II. Evolvement: Agent proposes a new instruction or refines an existing one, adding complexity. Expert ensures the task remains practical, verifiable, and sufficiently challenging.

III. Verification: Agent creates a programmatic verification script. Expert executes the task with an execution agent, then the script is run and iteratively improved until fully matching the instruction.

IV. Iteration: Steps II and III are repeated to gradually increase complexity while maintaining automatic verifiability and realism.

The project involved 10 experts with diverse backgrounds:

  • PhD students in computer science
  • Front-end designers
  • Full-stack & AI infra engineers
  • AI investors

Conclusion: Reality vs Hype

MCPMark is not just another academic benchmark. It’s a mirror reflecting the true state of AI agents in 2025.

Key Takeaways:

Even the best models handle realistic tasks in less than 53% of cases

The gap between pass@1 and pass^4 shows a critical stability problem

Remote services remain significantly harder than local ones

More moves don’t equal better results — quality of reasoning matters

Reasoning effort helps, but is not a universal solution

Being able to speak eloquently is one thing, but flawlessly executing complex, multi-step tasks in a real digital environment is quite another.

The new benchmark clearly shows where the current boundaries of capabilities lie. And these boundaries are much closer than we’d like to think.


Links:

Share it

If you liked the article - subscribe to my channel in the telegram https://t.me/renat_alimbekov


Other entries in this category: