ElevenLabs unveils Scribe v2, claims record-low speech-to-text error rate

ElevenLabs just launched Scribe v2, a new speech-to-text tool that says it barely makes mistakes, even with tough voices and noisy backgrounds. This upgrade can understand over 90 languages and is cheaper than before, costing less than $1 for every hour of audio. Scribe v2 also comes with cool features like picking out important words, tagging different speakers, and spotting sounds like laughter or applause. Experts say this could make subtitles and meeting notes super easy with almost no errors. Now, developers and big companies might switch to this tool to save time and money.

ElevenLabs has launched Scribe v2, a speech-to-text model engineered for a record-low word error rate across more than 90 languages. The company is positioning this upgrade as the industry's most accurate transcription tool, capable of handling noisy environments and diverse accents with unprecedented precision.

This update signifies a major advancement in automated transcription, pushing performance toward single-digit error rates for global accents and challenging audio formats. It solidifies transcription as a reliable tool for professional, enterprise-grade applications.

Accuracy Claims and Industry Benchmarks

ElevenLabs' Scribe v2 model aims to set a new industry standard for transcription accuracy. While specific figures are pending, it is engineered to deliver a word error rate significantly lower than previous versions, approaching near-flawless performance even with difficult audio and multiple languages.

While the company has not yet released a detailed numeric breakdown, it reports that Scribe v2 significantly improves stability on long-form audio and in noisy conditions. Internal benchmarks indicate it surpasses the performance of Scribe v1 - which achieved 96.7% English accuracy (a 3.3% Word Error Rate) - by further reducing errors for accented and multi-speaker audio.

Workflow Integrations and Pricing

Scribe v2 is available as a batch transcription API for processing large media libraries and a low-latency Scribe v2 Realtime model for live applications. The realtime variant is already integrated into ElevenLabs Agents, the company's conversational AI platform, according to an update on 6 January 2025 (https://elevenlabs.io/blog/scribe-v2-realtime-in-elevenlabs-agents).

The pricing structure remains highly competitive. Public entry rates for batch jobs are listed around $0.40 per audio hour, while business-tier API usage can be as low as $0.04 - $0.07 per hour at scale. The Scribe v2 Realtime model starts at $0.48 per hour, decreasing to $0.28 per hour for annual business plans.

Key enterprise capabilities include:
- Keyterm Prompting: Improves niche accuracy by recognizing up to 100 domain-specific words.
- Entity Detection: Identifies and flags content across 56 categories, including PII and medical terms.
- Speaker and Audio Tagging: Automatically labels different speakers and captures non-speech events like laughter or applause.

Market Impact and Outlook

Industry analysts note that AI transcription is now surpassing the "good enough" threshold for professional media and compliance workflows. With accuracy approaching 99% in ideal conditions, teams can automate subtitles, meeting minutes, and audio searches without extensive manual cleanup. This efficiency, combined with sub-$1-per-hour pricing, pressures traditional transcription services and shifts specialist roles toward quality assurance and accessibility oversight.

With Scribe v2, ElevenLabs is cementing its position in the multilingual AI arena while maintaining low costs for developers. Whether independent benchmarks validate its accuracy claims will determine how quickly industries like broadcasting, customer service, and media creation adapt their workflows.

What makes ElevenLabs Scribe v2 different from earlier models?

Scribe v2 delivers the lowest word error rate ever recorded on industry-standard benchmarks such as FLEURS and Common Voice, covering 90+ languages. It adds keyterm prompting for up to 100 domain terms, entity detection across 56 categories (PII, health, payments), and automatic multi-language transcription without manual language tags.

How low is the word error rate in real numbers?

ElevenLabs has only published relative claims so far: "lowest WER recorded" on FLEURS and Common Voice. Third-party tests of Scribe v1 showed ≈ 13% WER on clean English; Scribe v2 is positioned below the sub-5% range that leading 2025 models achieve under optimal conditions. Exact v2 figures have not been released.

What does Scribe v2 cost and how can teams access it?

Batch transcription: from $0.40/hour; high-volume API users on Business plans pay as little as $0.04-$0.07/hour.
Realtime streaming API: from $0.48/hour down to $0.28/hour on annual Business tiers.
All features are available via REST/WebSocket API, Python/Node SDKs, and inside ElevenLabs Studio for subtitles and captions.

Which enterprise features are included?

Ultra-low latency: ~150 ms for realtime conversations.
Security: enterprise-grade infra with PII redaction built-in.
Scalability: long-form files, speaker diarization, dynamic audio tagging, and 90+ language support out of the box.
Integration options: direct API, embedded in contact-center Agents, or white-label through partner stacks.

How will advanced transcription affect jobs and workflows?

AI systems reaching ~99% accuracy are turning hour-long meetings into searchable text in minutes, saving organizations $200k+ annually at 2,400 h/year volume. Roles shift from manual typing to AI-content ops, conversation analysts, and localization strategists who orchestrate and QA the models.