Building Better AI Benchmarks: How Many Human Raters Do You Really Need?

Building better AI benchmarks: How many raters are enough?

Creating reliable AI benchmarks isn’t just about better models—it’s about better evaluation.

At the heart of this challenge lies a deceptively simple question:

How many human raters are enough to trust your results?

Too few, and you miss the nuance of human opinion.
Too many, and costs spiral out of control.

This research introduces a practical framework to strike that balance—helping teams build AI benchmarks that are both reproducible and realistic, without wasting resources.

The Core Problem: Humans Don’t Always Agree

Reproducibility in machine learning means running the same experiment and getting the same result.

But there’s a catch.

Most evaluation datasets rely on human judgments—and humans don’t think like machines. They bring different perspectives, biases, and interpretations.

That means:

  • Two teams can evaluate the same model
  • Using the same setup
  • And still get different results

Why? Because human disagreement is often ignored.

The Big Trade-Off: Breadth vs Depth

When collecting human ratings, researchers face a key decision:

  • Breadth (the “forest”) → Many items, few raters per item
  • Depth (the “tree”) → Fewer items, many raters per item

Think of it like reviewing a restaurant:

  • Ask 1,000 people to try one dish each → broad overview
  • Ask 20 people to try the same 50 dishes → deeper insights

Historically, AI evaluation has favored breadth—usually 1 to 5 raters per item.

But that assumption may be flawed.

The Experiment: Simulating the Perfect Balance

To find the optimal mix, researchers built a large-scale simulator using real-world datasets involving subjective tasks like:

  • Toxicity detection
  • Hate speech classification
  • Conversational safety

They tested thousands of combinations by adjusting:

  • N (Scale): Number of items (100 → 50,000)
  • K (Crowd): Raters per item (1 → 500)

The goal? Identify which setups produce statistically reliable, reproducible results.

What Data Did They Use?

The study pulled from diverse, human-labeled datasets:

  • Toxicity dataset: 100K+ comments rated by 17K+ people
  • DICES dataset: Chatbot conversations evaluated across 16 safety dimensions
  • D3CODE: Cross-cultural offensiveness data from 21 countries
  • Jobs dataset: Tweets labeled across multiple job-related perspectives

They also tested messy, real-world scenarios—like skewed data (e.g., mostly spam) and multiple label categories.

Key Findings: Rethinking Old Assumptions

1. 3–5 Raters Isn’t Enough

The industry standard of using a handful of raters per item falls short.

It fails to capture:

  • The diversity of human opinion
  • The uncertainty in subjective tasks

In many cases, 10+ raters per item are needed for reliable results.

2. Your Goal Determines Your Strategy

There’s no universal “best” setup—it depends on what you’re measuring.

  • If you want accuracy (majority vote):
    Go broad. More items matter more than more raters.
  • If you want nuance (range of opinions):
    Go deep. More raters are essential to capture disagreement.

In short:

  • Breadth finds the average
  • Depth reveals the complexity

3. You Don’t Need a Huge Budget

Here’s the good news:

You can achieve strong reproducibility with around 1,000 total annotations—if you choose the right balance.

But if you get that balance wrong, even a larger budget won’t save you from unreliable results.

Why This Matters for the Future of AI

For years, AI evaluation has relied on a flawed assumption:

Every question has one correct answer.

That might work for objective tasks—but it breaks down in subjective domains like:

  • Harmful content detection
  • Social behavior analysis
  • Ethical decision-making

In these areas, disagreement isn’t noise—it’s valuable signal.

A Shift in Mindset

Instead of forcing a single “truth,” better benchmarks should:

  • Capture multiple perspectives
  • Reflect real-world disagreement
  • Measure both consensus and variation

This means moving beyond the “forest” approach and embracing the depth of the “tree.”

Final Takeaway

There’s no one-size-fits-all answer to how many raters you need.

But one thing is clear:

If you ignore human disagreement, your benchmark isn’t truly reliable.

The future of AI evaluation lies in designing systems that balance cost, scale, and human nuance—turning disagreement from a problem into a powerful insight.