How accurate is CallScribe for Gulf Arabic?

CallScribe uses Whisper large-v3-turbo optimized for Arabic dialects. Internal testing across 200+ Gulf Arabic call recordings (March 2026) shows 85-95% word-level accuracy for Khaleeji dialect with clear audio (SNR > 15dB).

Is my call data private with CallScribe?

Yes. CallScribe processes everything on your own server. No audio, transcripts, or metadata are sent to external servers. Zero external API calls during transcription.

What file formats does CallScribe support?

CallScribe accepts MP3, WAV, M4A, FLAC, OGG, and WebM audio files. Export formats include PDF, CSV, TXT, DOCX, and SRT for subtitle workflows.

How much does CallScribe cost?

CallScribe offers a free Starter plan (5 min/month), Business plan at $29/month (500 min), and Scale plan at $79/month (3000 min). No per-minute charges.

Does CallScribe support Khaleeji dialect?

Yes. CallScribe uses Whisper large-v3-turbo optimized for Gulf Arabic (Khaleeji). Internal testing across 200+ call recordings shows 85-95% word-level accuracy with clear audio (SNR > 15dB). March 2026 benchmark.

Can CallScribe handle code-switching between Arabic and English?

Yes. CallScribe detects when speakers switch between Arabic and English mid-sentence — a common pattern in GCC business calls. Both languages are transcribed accurately.

Is CallScribe GDPR compliant?

Yes. All processing happens on private infrastructure hosted in the EU (Hetzner, Germany) with optional GCC residency via Tailscale-routed workers. No audio, transcripts, or metadata are sent to external US-based servers during transcription. CallScribe publishes a Data Processing Agreement (DPA) covering GDPR Article 28 processor obligations, sub-processor disclosures (Stripe for billing, Resend for transactional email, Sentry for error telemetry), 72-hour breach notification, and data deletion on termination. The platform is also aligned with UAE PDPL data sovereignty requirements. Data Subject Access Requests can be submitted to privacy@callscribe.ae and are answered within 30 days.

What dialects of Arabic does CallScribe transcribe most accurately?

CallScribe is tuned primarily for Khaleeji (Gulf) Arabic — including Emirati, Saudi, Kuwaiti, Qatari, Bahraini, and Omani variants — where internal benchmarks on 200+ clear-audio calls (SNR > 15 dB) show 85-95% word-level accuracy. Levantine Arabic (Lebanese, Syrian, Palestinian, Jordanian) lands around 84-88%. Egyptian Arabic reaches 86-90%. Modern Standard Arabic (MSA), common in news and formal recordings, reaches 91-94%. Maghrebi dialects (Moroccan, Algerian, Tunisian) are not officially supported in this release. Accuracy drops on heavily overlapping speakers, poor SNR below 10 dB, or long-form mumbling. See the /model-card page for the full methodology and per-dialect WER table.

How does CallScribe compare to AWS Transcribe and Google Speech-to-Text for Arabic?

AWS Transcribe and Google Speech-to-Text both support Arabic but are tuned primarily toward Modern Standard Arabic, with limited coverage of Gulf, Levantine, and Egyptian dialects. In internal comparisons on Khaleeji call center recordings, both vendors produced 15-25% higher word error rates than CallScribe's Whisper large-v3-turbo pipeline. CallScribe also processes audio on private infrastructure in the EU or on GCC-resident workers — AWS and Google route audio through US regions by default, which is a compliance blocker for UAE PDPL and many GCC enterprise procurement policies. Pricing is flat-rate per minute bucket instead of per-second billing, which tends to save 30-60% at call center volume.

Can I deploy CallScribe on my own infrastructure?

Yes. CallScribe offers a self-hosted deployment option for Scale-tier customers and enterprise buyers. The stack runs entirely on Docker Compose: a Fastify API, a Python worker running Whisper large-v3-turbo and pyannote.audio for diarization, PostgreSQL, Redis, and nginx. Typical hardware: a single GPU worker (RTX 4090 or L4) handles up to 10x realtime throughput. The API tier runs comfortably on a 4-core VPS. Tailscale is used to connect worker nodes back to the control plane over a private mesh, so the GPU host can live in your own rack while the API stays in the cloud. Contact sales@callscribe.ae for a self-host deployment guide and license terms.

What is the turnaround time for a 1-hour call?

On the shared Business tier, a 1-hour call typically completes in 4-8 minutes end-to-end — including upload, transcription with Whisper large-v3-turbo, speaker diarization via pyannote, sentiment analysis, and audio quality scoring. Scale tier customers get priority queue placement and usually see 2-4 minutes for the same file. Self-hosted deployments on an RTX 4090 consistently process 60 minutes of audio in under 3 minutes (over 20x realtime). Queue wait time is the largest variable during peak hours on the free tier. WebSocket progress updates stream live from the worker so users see per-file percent-complete rather than a silent spinner.

How does CallScribe handle noisy call center recordings?

Call center audio is rarely clean — hold music bleed, codec artifacts, echo, cross-talk, and background PA announcements are all common. CallScribe runs a pre-processing pipeline that analyzes signal-to-noise ratio (SNR), root-mean-square loudness, and speech activity before transcription, then reports a per-file audio quality score so users know how much to trust the transcript. For SNR above 15 dB accuracy stays in the 85-95% band. Between 10 and 15 dB it degrades to the high 70s. Below 10 dB, CallScribe flags the file as low-confidence and recommends re-recording or applying an external denoiser before retry. Overlapping speakers are split via pyannote diarization, not just channel separation, so mono call recordings still work.

ما هو CallScribe؟

CallScribe هو منصة تحويل المكالمات الصوتية إلى نص مكتوب، مصممة خصيصاً للأسواق العربية. يدعم اللهجة الخليجية والشامية والمصرية بدقة ٨٥-٩٥٪، بالإضافة إلى الإنجليزية والأردية والهندية.

هل يدعم اللهجة الخليجية؟

نعم. CallScribe مُحسَّن للهجة الخليجية. نستخدم نموذج Whisper large-v3-turbo المُعدَّل للمحادثات العربية الحقيقية. الاختبارات الداخلية على أكثر من ٢٠٠ مكالمة خليجية أظهرت دقة ٨٥-٩٥٪.

كم تكلفة الخدمة؟

خطة مجانية: ٣٠ دقيقة شهرياً. خطة الأعمال: ٢٩ دولار شهرياً مع ٥٠٠ دقيقة. خطة النمو: ٧٩ دولار شهرياً مع ٣٠٠٠ دقيقة.

هل بياناتي آمنة؟

نعم. جميع المعالجة تتم على بنية تحتية خاصة. لا يتم إرسال أي ملفات صوتية أو نصوص إلى خوادم خارجية. متوافق مع GDPR ومتطلبات سيادة البيانات في دول الخليج.

هل يدعم التبديل بين العربية والإنجليزية؟

نعم. CallScribe يكتشف تلقائياً عندما يتحول المتحدث بين العربية والإنجليزية في نفس الجملة — وهو نمط شائع في مكالمات الأعمال في دول الخليج.

Benchmarking Arabic ASR: What Actually Works

Every vendor with an Arabic speech-to-text API will tell you their model is state of the art. The marketing page shows an impressive WER number — maybe single digits, maybe mid-teens — and a generic mention of "dialect support". Then you feed the API a real Khaleeji call center recording and the output is nonsense. Why does this keep happening, and what does a fair Arabic ASR benchmark actually look like? This post walks through the models we evaluated while building CallScribe, the datasets we tested them on, and the evaluation mistakes that make most public comparisons useless.

The Model Landscape in 2026

There are roughly four families of multilingual speech models worth comparing on Arabic today: Whisper large-v3 and its distilled turbo variant from OpenAI, wav2vec2-XLS-R and its Arabic fine-tunes from Meta, SeamlessM4T (also from Meta), and the commercial APIs from AWS Transcribe and Google Cloud Speech-to-Text. Open-source Arabic-only models like Elyadata/wav2vec2-large-xlsr-53-arabic and various university research checkpoints round out the landscape.

Whisper's advantage is its massive pretraining corpus and its robustness to noisy, real-world audio. It was trained on roughly 680,000 hours of weakly supervised multilingual audio, including a meaningful chunk of Arabic. The large-v3-turbo variant shipped by OpenAI in 2024 keeps most of the accuracy of large-v3 at roughly six times the inference speed — an enormous quality-of-life win for anyone running it in production. For our workload (call center recordings, 2-5 minute files, mostly Gulf Arabic) Whisper large-v3-turbo consistently outperformed every open-source alternative we tested.

wav2vec2-XLS-R is a different design philosophy: self-supervised pretraining on raw audio followed by supervised fine-tuning on a labeled dataset. The Arabic fine-tunes available on Hugging Face are useful but limited — most were trained on MGB-2 (broadcast news) or Common Voice Arabic, which does not match call center audio at all. They work fine on clean MSA and fall apart on Khaleeji. We tried three of the most popular ones and they all underperformed Whisper large-v3-turbo by 8-15 WER points.

Per-Dialect Results

Our internal benchmark ran on a 200-recording corpus of GCC call center audio, hand-annotated with reference transcripts and speaker labels. The table below summarizes word error rates by dialect on clear audio (SNR above 15 dB). All numbers are from our own evaluation — they should not be compared to published benchmarks on different datasets without caveats.

Khaleeji (clear audio): Whisper large-v3-turbo 8-12%, wav2vec2-XLS-R Arabic fine-tunes 18-26%, AWS Transcribe 16-22%, Google STT 17-24%.
Levantine (clear audio): Whisper large-v3-turbo 12-16%, wav2vec2 fine-tunes 22-30%, AWS Transcribe 19-27%, Google STT 21-29%.
Egyptian (clear audio): Whisper large-v3-turbo 10-14%, wav2vec2 fine-tunes 18-24%, AWS Transcribe 15-21%, Google STT 16-22%.
MSA news-style (clear audio): Whisper large-v3-turbo 6-9%, wav2vec2 fine-tunes 8-13%, AWS Transcribe 7-11%, Google STT 8-12%.

The headline: Whisper large-v3-turbo wins on every dialect in our corpus, but the margin narrows dramatically on MSA — which is unsurprising, since MSA is what most open datasets teach models to do. The margin widens on noisy audio and dialect-heavy recordings, which is exactly the use case we care about.

Code-Switching Behavior

GCC business calls regularly switch between Arabic and English mid-sentence — sometimes mid-word. "Okay يعني I will check the account و I will call you back" is a normal utterance in a Dubai customer service call. How well do these models handle that?

Whisper handles code-switching cleanly in most cases. The language detector runs per-chunk, and the decoder is capable of emitting both Arabic and English tokens in the same segment. We measured roughly a 2-3 WER-point penalty on code-switched utterances compared to single-language utterances of similar length — meaningful but not catastrophic. wav2vec2 fine-tunes tend to collapse on code-switching because they were trained with a fixed vocabulary and language prior; English tokens come out as Arabic phonetic approximations. AWS Transcribe requires you to pick a language up front, and Google STT has a code-switching mode that works on paper but produced inconsistent results in our testing.

Evaluation Pitfalls

The biggest mistake in public Arabic ASR comparisons is comparing models that were trained on the benchmark dataset against models that were not. If you evaluate Model A on Common Voice Arabic and Model A was trained on Common Voice Arabic, you are measuring overfitting, not generalization. MGB-2, Common Voice, and Tashkeela all show up in multiple training sets. Any serious benchmark must hold out a dataset the models have never seen.

The second mistake is reporting a single WER number. WER depends enormously on audio quality, dialect, and call length. A vendor showing 5% WER on a 30-second MSA news clip is telling you almost nothing about how the model will perform on an 8-minute Khaleeji call with hold music bleed. Segment your evaluation by the conditions your production data actually contains.

The third mistake is ignoring normalization. Arabic text has multiple valid spellings for the same word (alef variants, taa marbuta vs. taa, hamza placement), and WER calculations either collapse these variants or treat them as errors depending on the normalization pipeline. If you do not publish the normalizer, your numbers cannot be reproduced.

What We Ship

CallScribe ships Whisper large-v3-turbo as the primary engine, with pyannote.audio 3.x for diarization and a multilingual sentiment model on top of the transcript. We run all inference on our own GPUs, not a third-party API. The eval harness we built internally is on the roadmap to open source — once the dataset licensing is cleared, we will release a reproducible benchmark so anyone can rerun the numbers in this post on their own audio.

Until then, the best thing you can do if you care about Arabic ASR quality is run a model on your calls, compute WER against your reference transcripts, and ignore every vendor that will not let you do that.