A Practical Guide to Arabic Dialect Coverage for ASR

Khaleeji sub-variants, Levantine regional differences, Egyptian urban speech, Maghrebi caveats, and the code-switching patterns that show up in real GCC calls.

"Arabic support" is a phrase with an enormous amount of hiding room. When a speech-to-text vendor tells you their model handles Arabic, the realistic question to ask is: which Arabic? The language as spoken in a Dubai call center sounds nothing like the language on Al Jazeera news. A Lebanese speaker and a Moroccan speaker, dropped into the same conversation, would not fully understand each other. If you are deploying ASR on Arabic audio, you need to understand this variation before you pick a model, because the model's training data baked in specific assumptions about which Arabic "counts". This post is a practical field guide to the dialect landscape from an ASR engineering perspective.

Khaleeji (Gulf Arabic)

Khaleeji is the cluster of dialects spoken across the Gulf Cooperation Council states — the UAE, Saudi Arabia, Kuwait, Qatar, Bahrain, and Oman. It is not a single dialect. An Emirati speaker and a Saudi Hejazi speaker will sound clearly distinct to a native ear, and an ASR model that only saw Saudi training data will underperform on Emirati audio in ways you can measure.

Major regional variants worth naming:

  • Emirati Arabic — UAE cities, especially Dubai and Abu Dhabi. Characterized by rapid delivery, heavy borrowing from English and Urdu in business contexts, and a specific set of demonstrative pronouns that differ from other Gulf variants.
  • Saudi Arabic — itself splits into Najdi (central, including Riyadh), Hejazi (western, including Mecca and Jeddah), and Southern Saudi. Najdi is what most ASR datasets label as "Saudi" and is the closest Gulf variant to MSA in phonology.
  • Kuwaiti Arabic — similar to Najdi but with distinctive vowel patterns and heavy borrowing from Persian and English.
  • Qatari and Bahraini Arabic — closely related, both close to the Emirati/Saudi Najdi spectrum.
  • Omani Arabic — has its own distinct vocabulary influenced by Swahili, Baluchi, and Persian due to Oman's historical trade networks. Often the hardest Gulf variant for models trained on standard Saudi data.

A well-trained Arabic ASR model should be evaluated on at least three of these variants before anyone claims "Gulf Arabic support". A model evaluated only on Riyadh studio recordings is going to fall over in Muscat or Salalah.

Levantine (Shami) Arabic

Levantine Arabic is the speech of Lebanon, Syria, Palestine, and Jordan. It is often the second-largest dialect bucket in multilingual Arabic datasets after MSA, partly because of the media footprint of Lebanese and Syrian drama production. Major sub-dialects:

  • Lebanese Arabic — Beirut and northward. Very high code-switching with French and English, softer consonants, distinctive "halla2"/"hayda" vocabulary.
  • Syrian Arabic — Damascus and Aleppo are the two main urban centers. Damascene is closest to the "generic Levantine" that most models were trained on.
  • Palestinian Arabic — internal variation is significant: Ramallah vs. Nablus vs. Gaza.
  • Jordanian Arabic — urban Amman is close to Palestinian; rural Jordanian diverges more.

For a GCC business, Levantine audio mostly shows up in two scenarios: customers who emigrated from the Levant now living in the Gulf, and offshore call center agents operating from Jordan, Lebanon, or Egypt serving Gulf customers. Both scenarios matter for ASR coverage.

Egyptian Arabic

Egyptian Arabic is the single most widely understood Arabic dialect across the entire Arabic-speaking world, thanks to a century of Egyptian cinema and TV. Cairo urban speech dominates the dataset pool. ASR models generally do well on it. Upper Egyptian (Sa'idi) diverges more and is underrepresented in training data. In GCC call centers, Egyptian speakers are very common on the agent side and show up frequently in customer-facing calls even for non-Egyptian customers.

Maghrebi (Darija)

Moroccan, Algerian, and Tunisian dialects are collectively called Maghrebi or Darija and are effectively unsupported by most commercial Arabic ASR systems. The phonology, vocabulary, and French-Berber substrate differ enough from Eastern Arabic that a model trained on Gulf/Levantine/Egyptian/MSA will produce gibberish on Casablanca street speech. This is a real gap in the market, but it is also not the target for a GCC-focused platform — we do not support Maghrebi in CallScribe today, and we flag it clearly in our model card rather than pretend otherwise.

Code-Switching in GCC Business Calls

The single most distinctive feature of real GCC call audio is code-switching between Arabic and English. It is not occasional. It is constant, and it happens at every granularity:

  • Lexical — individual English words dropped into Arabic sentences: "okay", "really", "meeting", "invoice", "account number".
  • Phrasal — English phrases as interjections: "no problem", "to be honest", "you know what I mean".
  • Clausal — entire clauses in English within an otherwise Arabic utterance: "I will call you back لأن the system is down".
  • Turn-level — one speaker fully in Arabic, the other fully in English, alternating turn by turn.

A model that cannot handle code-switching gracefully will either drop the English words entirely, misrepresent them as phonetic Arabic, or refuse to transcribe that segment. Whisper large-v3-turbo handles this well in practice because its language detector runs at a fine-grained timescale and the decoder vocabulary covers both languages.

What This Means for ASR Procurement

When you evaluate an Arabic ASR system, insist on a benchmark that covers the specific dialects in your audio. Do not accept a single "Arabic WER" number. Ask for per-dialect breakdowns, per-audio-quality breakdowns, and code-switching tests. Ask which variants were in the training set and which were only in the eval set. Require the vendor to run their model on a held-out sample of your actual calls before you sign a contract. A model that looks great on an MSA news benchmark will fail on a Khaleeji complaint line, and the only way to find out is to test it.