Categories: Technology

Self-play changed AI forever — most people missed it

There’s something that bugs me. Everybody talks about ChatGPT. Everybody hears about AlphaGo. But the training method that is probably even bigger than either of them — self-play — is almost invisible in the general public conversation. It’s like knowing about the iPhone, but not knowing about touchscreens.

Self-play is the way you train an AI system by having it compete against copies of itself over and over again, millions of times. It figures out strategies that no human ever thought of. It doesn’t require labeled data. It doesn’t require expert demonstrations. It doesn’t require hand-holding. It simply requires an objective function and lots of compute.

Now self-play is training robots to walk, teaching autonomous cars to handle aggressive drivers, helping chemists discover new compounds, and powering the adversarial testing systems that protect critical infrastructure. Let me walk through how we got here.

Learning from humans isn’t ideal

Traditionally, if you want to train an AI system you show it many examples, all labeled by a human. The AI then goes through the examples and looks for patterns. This works great for classification — identifying images, recognizing speech, detecting spam. But there’s a hard ceiling. When you train from human examples, the AI inherits all of the biases, blind spots and limitations of the humans who provided the labels. It can never be better than humans because human behavior is the upper bound.

Self-play eliminates that ceiling. Each iteration trains against previous versions of itself. The system wanders into areas of strategy space that no human would explore — and finds things that work.

AlphaGo’s famous “Move 37” against Lee Sedol in 2016 is the classic example. Commentators called it a mistake on live TV. It wasn’t. It was a strategy born from millions of self-play iterations that no human player or coach would have suggested. That single move changed the outcome of the match.

When it clicked for me

In January 2017, I watched Carnegie Mellon’s Libratus compete against four top human experts in a 120,000-decision competition. Libratus won $1.77 million in simulated stakes. All four experts finished negative.

What really caught my attention wasn’t the result. It was what the researchers said afterward. Libratus developed a strategy that diverged dramatically from anything the human experts would have recommended. The AI found approaches that looked wrong to trained human eyes but were demonstrably superior.

This is precisely the point. Self-play doesn’t find human-optimal strategies. It finds actual-optimal strategies. Those two things are not the same.

Where self-play lives now

Since 2017, I’ve seen a rapid progression toward applying self-play beyond competitive environments.

1. Autonomous vehicles

Waymo and many others use self-play to train AI that handles worst-case driving scenarios. One network acts as the “adversary driver” — cutting off, braking suddenly, running red lights. Another network learns to respond. Through self-play, both become progressively more sophisticated, creating robustness that scripted tests cannot achieve because scripts cannot surprise themselves.

2. Drug discovery

Generative chemistry systems run self-play between a molecule generator and an evaluator. The generator proposes candidates, the evaluator scores them, both improve through competition. One pharma company I spoke with said their platform evaluates approximately 10 million molecular candidates every week. Prior to self-play, they managed about 50,000.

3. Cybersecurity red-teaming

An “attacker” model tries to penetrate defenses. A “defender” model tries to stop it. Both evolve through interaction. Organizations using these systems report that the attack patterns discovered would likely go undetected in penetration testing performed by human testers relying on known threat vectors. Security researchers call these “alien tactics” — legitimate but radically different from anything in existing threat databases.

4. Financial trading

Quant firms use self-play in simulated markets. The AI learns that other participants react to its orders and adjusts accordingly. A researcher at one firm told me that conventional quantitative trading is akin to playing chess against a random number generator, whereas self-play-trained trading is akin to playing chess against Magnus Carlsen. Different game entirely.

5. Negotiation

Buyer and seller agents negotiate against each other, both improving simultaneously. Self-play-trained negotiation systems outperform rule-based approaches by large margins — and often beat skilled human negotiators — primarily because they avoid developing systematic patterns that are easy to exploit.

Ten years of lessons

Ten years of research have clarified what works and what doesn’t:

1. Diverse opponents beat a single strong opponent

Early self-play systems suffered from “strategy collapse” — both sides converged on a narrow approach that failed against anything different. Modern approaches maintain populations of diverse agents. Training against hundreds of different opponents instead of one increasingly strong one.

2. Regret minimization beats reward maximization

Maximizing reward creates predictable patterns. Your opponent recognizes what you do and adapts. Regret minimization — asking “how much would I regret each alternative?” rather than “what maximizes my expected gain?” — produces strategies resistant to exploitation. This concept came from Counterfactual Regret Minimization research and is now standard practice.

3. Real-time adaptation beats static deployment

The best systems don’t train offline and deploy a fixed strategy. Like Meta AI’s Pluribus (2019), they solve subproblems during execution and adapt to the specific conditions of the moment. Pre-computing a response for every possible input is mathematically impractical for complex domains. Solving the immediate subproblem in real time is scalable and produces better outcomes.

Honest limitations

Self-play requires a clear objective function — the system must know what “winning” means. It requires a simulatable environment to iterate cheaply. It needs sufficient compute, which is still expensive for complex problems. It struggles with open-ended tasks where success isn’t easily measured, and with environments that change faster than the training loop.

It can also produce optimal but incomprehensible solutions — effective but impossible for humans to analyze or verify.

However, within its domain — strategic decision-making under uncertainty with adaptive adversaries — nothing comes close.

My prediction

Self-play will be to the 2030s what deep learning was to the 2010s — the default training paradigm for any AI system operating in adversarial environments. We’ll look back and wonder why we spent so long trying to teach AI from human examples when we could have just let it teach itself.

The technique that started in university game theory labs is becoming foundational infrastructure. And we’re still in the early chapters.

Sonia Shaik

Soniya is an SEO specialist, writer, and content strategist who specializes in keyword research, content strategy, on-page SEO, and organic traffic growth. She is passionate about creating high-value, search-optimized content that improves visibility, builds authority, and helps brands grow sustainably online. She enjoys turning complex SEO concepts into clear, actionable insights that businesses and creators can actually use to grow. Through her work, Soniya focuses on helping brands strengthen their digital presence, rank higher in search engines, and build long-term organic growth strategies—while continuously exploring how content, storytelling, and strategy can drive meaningful online success.

See Full Bio