Towards a science of scaling agent systems: When and why agent systems work

A New Science of Agent Scaling

Real-world AI applications are increasingly using AI agents, which are systems that can reason, plan, and act. From personal health coaches to coding assistants, the industry is moving away from one-time question responses and toward ongoing, multi-step interactions. While researchers have long utilized established metrics to optimize the accuracy of traditional machine learning models, agents introduce a new layer of complexity.

Agents must navigate sustained, multi-step interactions, not isolated predictions, where a single error can spread throughout a workflow. This change forces us to look beyond standard accuracy and consider the question of how to actually design these systems to perform at their best. Practitioners frequently rely on heuristics, such as the belief that adding more specialized agents will consistently result in better outcomes. For example, “More Agents Is All You Need” reported that LLM performance scales with agent count, while collaborative scaling research found that multi-agent collaboration “…often surpasses each individual through collective reasoning.”

We question this assumption in our most recent paper, “Towards a Science of Scaling Agent Systems.” The first quantitative scaling principles for agent systems are derived from a large-scale controlled evaluation of 180 agent configurations. This demonstrates that the “more agents” approach frequently reaches a ceiling and can even reduce performance if it is not aligned with the task’s specific properties.

In This Content

Defining “agentic” evaluation

First, we defined what makes a task “agentic” in order to comprehend how agents scale. Traditional static benchmarks measure a model’s knowledge, but they don’t capture the complexities of deployment. We contend that three specific properties are required for agentic tasks:

Interactions with an external environment that last for several steps.
Under partial observability, iterative information collection.
Adaptive strategy refinement based on environmental feedback.

We evaluated five canonical architectures: one single-agent system (SAS) and four multi-agent variants (independent, centralized, decentralized, and hybrid) across four diverse benchmarks, including Finance-Agent (financial reasoning), BrowseComp-Plus (web navigation), PlanCraft (planning), and Workbench (tool use). The following are the definitions of agent architectures:

Single-Agent (SAS): A solitary agent executing all reasoning and acting steps sequentially with a unified memory stream.
Independent: Multiple agents working in parallel on sub-tasks without communicating, aggregating results only at the end.
Centralized: A “hub-and-spoke” model where a central orchestrator delegates tasks to workers and synthesizes their outputs.
Decentralized: A peer-to-peer mesh where agents communicate directly with one another to share information and reach consensus.
Hybrid: A strategy that strikes a balance between flexible execution and central control by employing both peer-to-peer coordination and hierarchical oversight.

The myth of “more agents” as a result

We evaluated our architectures against three prominent model families: OpenAI GPT, Google Gemini, and Anthropic Claude, in order to quantify the effect that model capabilities have on agent performance. Model capabilities and coordination strategy have a complicated relationship, according to the findings. Multi-agent systems are not a one-size-fits-all solution; depending on the configuration, they can either significantly improve performance or unexpectedly decrease it, as shown in the figure below. While performance generally improves with more capable models, they are not a universal solution.

The performance of the five architectures in various applications, such as web browsing and financial analysis, is compared in the following results. The accuracy distribution for each method is shown in the box plots, and the percentages show how much better or worse multi-agent teams are compared to the single-agent baseline. This data demonstrates that, despite the fact that adding agents can significantly improve parallelizable tasks, it frequently results in diminishing returns or even performance decreases in sequential workflows.

The alignment principle

Compared to a single agent, centralized coordination improved performance by 80.9% on parallelizable tasks like financial reasoning, where multiple agents can simultaneously analyze revenue trends, cost structures, and market comparisons. Agents were able to work more efficiently when they were able to break down intricate problems into smaller tasks.

The sequential penalty

In contrast, every multi-agent variant we tested decreased performance by 39-70% on tasks requiring strict sequential reasoning, such as planning in PlanCraft. The reasoning process was fragmented in these scenarios due to the overhead of communication, leaving an inadequate “cognitive budget” for the actual task.

The tool-use bottleneck

A “tool-coordination trade-off” was found by us. The “tax” of coordinating multiple agents rises in a disproportionate manner as tasks require more tools (for example, a coding agent with access to more than 16 tools).

Architecture as a safety feature

Perhaps most important for real-world deployment, we found a relationship between architecture and reliability. Error amplification, or the rate at which a mistake made by one agent reaches the final result, was what we measured.

We found that independent multi-agent systems (agents working in parallel without talking) amplified errors by 17.2x. Without a mechanism to check each other’s work, errors cascaded unchecked. Centralized systems (with an orchestrator) contained this amplification to just 4.4x. The orchestrator effectively prevents errors from spreading by acting as a “validation bottleneck.”

A predictive agent design model

We developed a predictive model with a R2 value of 0.513 that goes beyond retrospection and uses measurable task properties like tool count and decomposability to predict which architecture will perform best. For 87% of unidentified task configurations, this model correctly determines the most effective coordination strategy.

This suggests that agent scaling is becoming a new science. Instead of guessing whether to use a swarm of agents or a single powerful model, developers can now look at the properties of their task, specifically its sequential dependencies and tool density, to make principled engineering decisions.

Conclusion

As foundational models like Gemini continue to advance, our research suggests that smarter models don’t replace the need for multi-agent systems, they accelerate it, but only when the architecture is right. We can create the next generation of AI agents that are not only more numerous, but also smarter, safer, and more effective by switching from heuristics to quantitative principles.