Factcheck: Expired
Correct when formed: through early 2025, Multilingual v2 plus the Flash/Turbo line had no rival on the combination of naturalness, cloning fidelity, and language breadth. MiniMax ended that in May 2025. Speech-02-HD became the first model to pass both ElevenLabs and OpenAI on the Artificial Analysis Speech Arena, Elo ~1164 against 1116 for Multilingual v2, a result MiniMax claims directly in its arXiv paper (2505.07916). Inworld took the top slot in November 2025, and the June 2026 standings put Eleven v3 fourth.
The business and the benchmark diverged. ElevenLabs closed 2025 somewhere between $330M and $350M ARR (CNBC and Sacra disagree on the figure) and raised $500M at $11B in February 2026 with Sequoia leading, money aimed at expansion into video via the LTX partnership and into agents. The quality moat is gone; the product moat of voice library, dubbing, agents, and enterprise compliance is intact.
One caveat on the instrument: arena Elo is blind preference on short prompts and underweights cloning fidelity, long-form stability, and multilingual depth, dimensions where ElevenLabs plausibly still leads. Fourth place is still not peerless.
| # | model | lab | elo |
|---|---|---|---|
| 1 | Realtime TTS 1.5 Max | Inworld | 1206 |
| 2 | Gemini 3.1 Flash TTS | 1205 | |
| 3 | StepAudio 2.5 | StepFun | 1188 |
| 4 | Eleven v3 | ElevenLabs | 1180 |
| 5 | Speech 2.8 HD | MiniMax | 1163 |
For scale: Multilingual v2 held the field uncontested at 1116 in January 2025. Strongest open model today is Fish Audio S2 at ~1128.
Factcheck: Confirmed
The bounty was real and it paid out. Fish Audio S2 is the strongest open model at ~1128, Kokoro's 82M parameters run on consumer hardware, and Resemble's Chatterbox beat ElevenLabs in blind preference tests at a vendor-reported 63.75% win rate. The perceptual gap to ElevenLabs mostly closed; open weights now rival the old king on naturalness, and self-hosting them is production-credible.
Factcheck: Split
Half right, for an unexpected reason. The gap to ElevenLabs did mostly close, so the hedge was too pessimistic on perception. What stayed out of reach was the frontier: every model that actually passed ElevenLabs is closed. MiniMax, Inworld, Google, StepFun. Open weights caught the old king and missed the new leaders, who now split between US and Chinese labs. ByteDance's Seed/Doubao line and Alibaba's CosyVoice and Qwen audio stack fill out the Chinese cluster, with Alibaba's Fun-Realtime posting the lowest ASR word error rate at 1.8%.
And the evidence behind the hedge was the wrong instrument entirely. That's the next highlight.
Factcheck: True, wrong instrument
The ear-test is right and the inference is wrong. Veo 3 (Google) and Sora 2 (OpenAI) generate audio jointly with video in a single pass, optimizing audio-visual synchronization, lip-sync, and ambient plausibility rather than peak vocal fidelity. Four mechanisms cap the speech quality.
The slop is therefore a property of audio as a byproduct of video, not of multimodal models in general. The proof is Gemini 3.1 Flash TTS sitting second on the arena: when a multimodal lab makes audio the product instead of the side effect, it matches the specialists. Google's own creator guidance treats Veo dialogue as a draft layer to replace with dedicated TTS, which is also the right way to read Sora 2's audio. The realtime axis has the same shape.
Cascaded ASR-LLM-TTS stacks best-in-class parts but discards prosody at every text boundary and runs near 510ms best-case (Cartesia, State of Voice) against ~230ms human turn-taking. Native speech-to-speech, meaning Moshi's dual-stream design, GPT-Realtime-2 with its 128K context, or Gemini native audio, collapses the stack and keeps the paralinguistics at the cost of control precision. Convergence as of mid-2026 is hybrid: omni models for conversation, understanding, and synchronized media; specialists wherever vocal fidelity is the product, as in audiobooks, dubbing, and branded voices.
Factcheck: Confirmed
Right posture, wrong mechanism. Google ran the conservative book: SynthID on every output, native audio red-teamed and gated before release, likeness held tighter than peers. In practice it is watermark-first conservatism across modalities rather than a discrete voice holdout, and Google still shipped TTS strong enough to sit second on the arena.
Factcheck: Confirmed
And now hardening into statute:
Payment processors and app stores enforce beneath the statutes, and most of the enforcement weight lands on the watermark regime.
SynthID went from 10B watermarked pieces at the May 2025 Detector launch to over 100B by May 2026, with Nvidia, Kakao, and ElevenLabs adopting it. The convergence date is May 19, 2026: OpenAI joined the C2PA steering committee and committed to embedding SynthID alongside its content credentials, and Google added native C2PA/SynthID verification to Search and Chrome.
The regime is real and leaky. C2PA metadata strips on re-encode, invisible watermarks degrade under hostile editing, and a February 2025 paper ("On the Difficulty of Constructing a Robust and Publicly-Detectable Watermark") argues no scheme achieves unforgeability, robustness, and public detectability simultaneously. The dual-layer bet is that each scheme covers the other's failure mode. The sane design target is audit trail and deterrence rather than proof.
Factcheck: Confirmed
Grok Imagine's spicy mode shipped in August 2025 and produced uncensored topless video of real people immediately; The Verge's Jess Weatherbed reported it generated topless Taylor Swift on her first use, unprompted. Rolling Stone documented hardcore video by October 2025.
The January 2026 mass-undressing wave (Reuters: millions of sexualized images, some appearing to involve minors) brought EU DSA proceedings, an Ofcom investigation under the Online Safety Act, a California AG probe, and national bans in Indonesia and Malaysia. xAI restricted real-person undressing and put image editing behind the paywall but never shut spicy mode down. Musk's March 2026 standard: if an R-rated movie allows it, Grok Imagine allows it.
Factcheck: Falsified
Broken in August 2025, and xAI is the counterexample: actual nudity at launch, hardcore by October. The line held everywhere else, and OpenAI drew it exactly where the consensus assumed everyone would. Meta had its celebrity-chatbot embarrassments and shipped no frontier nudity; Anthropic stays out of media generation entirely. (What about the open source?)
There always was one, and it predates the frontier's whole debate. Open weights get uncensored the moment they ship: the Stable Diffusion lineage and its fine-tune culture made explicit generation a community default years before Grok, with raw demand doing the work that clout did for voice. The "nobody at the frontier" line was true only because frontier means the institutions, not the capability; below the institutions, content policy is a fork away. That is also why xAI crossing it mattered: it moved explicit generation from anonymous forks to a flagship product with a login page, which is the thing regulators can actually reach.
Altman announced erotica for verified adults on October 14, 2025 under the "treat adults like adults" framing, slipped the launch to Q1, then paused it indefinitely (FT via TechCrunch, March 26, 2026). The internal build, "Citron mode," reportedly could not cleanly scope out bestiality and incest references, and a January advisory session warned the company risked shipping a "sexy suicide coach." Even as announced it was text-only.
Sora 2's real moderation fight was likeness rather than sex: unauthorized Bryan Cranston, Robin Williams, and MLK generations within days of the September 30 launch, the King estate forcing a pause, then the October 20 joint statement with SAG-AFTRA and the move from opt-out to opt-in likeness.
Factcheck: Confirmed
The revealed lines, lab by lab: