CallScribe Model Card & WER Methodology
How we measure accuracy, which dataset we benchmark against, and where the model has known weaknesses.
Engine
CallScribe's transcription backend is OpenAI Whisper large-v3-turbo, hosted in our own infrastructure and served by a Python worker with GPU inference. Speaker diarization runs on pyannote.audio 3.x pretrained pipelines. Sentiment analysis runs on a multilingual transformer fine-tuned on Gulf Arabic call transcripts.
Benchmark Dataset
The reported word error rates below come from an internal evaluation dataset of over 200 GCC call recordings collected in March 2026, with written consent from the participants. The dataset includes:
- Roughly 60% Khaleeji (UAE, Saudi, Kuwaiti, Qatari, Bahraini, Omani).
- Roughly 20% Levantine (Lebanese, Syrian, Jordanian, Palestinian).
- Roughly 10% Egyptian.
- Roughly 10% MSA (news-style and formal customer service).
- Audio SNR range 8 to 30 dB, biased toward 15+ dB.
- Call durations between 45 seconds and 32 minutes, median 6 minutes.
- Mixed speaker counts (2 to 5), with diarization ground truth hand-annotated.
Placeholder methodology — will be refined with a published benchmark and released evaluation harness in a future update.
Word Error Rate by Dialect
| Dialect | WER (clear audio) | WER (noisy < 15 dB) | Notes |
|---|---|---|---|
| Khaleeji (Gulf) | 8–12% | 18–25% | Primary tuning target |
| Levantine | 12–16% | 22–30% | Lebanese best-covered variant |
| Egyptian | 10–14% | 20–27% | Cairo urban accent |
| MSA | 6–9% | 15–20% | News and formal speech |
| Maghrebi | Not supported | Not supported | Darija on roadmap |
WER values are placeholder ranges from internal evaluation. A final single-point WER per dialect will be published alongside the open eval harness.
Known Limitations
- Heavy overlapping speech (more than 30% of segments overlapping) causes diarization drift.
- Poor SNR below 10 dB produces transcripts that should be treated as drafts.
- Long monologues with mumbling or background music bleed degrade accuracy.
- Code-switching with languages outside Arabic/English (e.g. Hindi or Urdu mid-sentence) is handled but with 5-10% accuracy penalty on the non-Arabic segment.
- Maghrebi dialects (Moroccan Darija, Algerian, Tunisian) are not officially supported.
Responsible Use
Machine transcripts must not be treated as legally authoritative without human review. CallScribe is a productivity tool, not a court-admissible record. Customers in regulated industries (healthcare, legal, financial services) must apply their own review and compliance process before acting on transcribed content.