Why Generative AI Underperforms in non-english languages

Lost in Translation

Conversational AI is dominated by English, with serious consequences for other languages that are structural and can only be resolved with significant effort.

Ask ChatGPT a complex question in English and you'll quite often get a correct, well-formulated answer that fits the context. Try the same in Hindi, Bengali or Yoruba, and the response tends to get shorter, vaguer, and occasionally plain wrong. In German, French or Spanish, the answers are more on-point, but they often still fall short of the content and linguistic quality of an English response.

Generative AI has a language problem. And it doesn’t only affect rare or endangered languages — it hits widely spoken ones too. The Brookings Institute describes the quality gap as a continuum: from English through European languages like German, French and Spanish, all the way to the roughly 7,000 languages spoken worldwide, of which only about 20 are considered “data-rich” — with the gap widening dramatically as you move down the list. This is a problem that surfaces repeatedly in GenAI projects. Non-English systems struggle with precision, hallucinate more frequently, and simply fabricate content that doesn’t exist.

About this post

We explore the structural and technical reasons behind the strong English bias of most natural language processing models and evaluate the consequences for implementation and use.

Key takeaways

The dominance of English in NLP is structural and can't be easily engineered away. This creates disadvantages for non-English speaking users and requires additional work to produce good results with the technology.

The text was written by a human and submitted to an AI system for final review, such as checking grammar, typos, or logical consistency - please click for more information

“If people feel like the AI doesn't understand them, or they can't access it, it brings them no benefit.”

Leslie Teo, AI Singapore

The core problem: these systems were built in English

This isn’t a bug that someone forgot to fix. It’s a structural issue baked into the architecture of virtually every language model out there. Models learn from data — and the data is overwhelmingly English. Of the content in the Common Crawl dataset, the backbone of most large language model training, over 40% is in English, while no other language comes close to 7%. Models know what they’ve seen, and most of it is in English.

Language	Common Crawl Share (CC-MAIN-2026-12)	Total Speakers	% of World Population	Ratio (Web vs. Speakers)
English	41.06 %	~1.53 billion	~18.7	+2.2x
German	5.98 %	~135 million	~1.6	+3.7x
Chinese	4.99 %	~1.18 billion	~14.4	+0.35x
Spanish	4.66 %	~560 million	~6.8	+0.7x
French	4.61 %	~310 million	~3.8	+1.2x
Italian	2.38 %	~90 million	~1.1	+2.2x
Hindi	0.22 %	~610 million	~7.4	+0.03x

Sources
https://commoncrawl.github.io/cc-crawl-statistics/plots/languages (accessed March 30, 2026).
Ethnologue 2025 (Eberhard, Simons & Fennig, eds., Ethnologue: Languages of the World, 27th ed., SIL International) — for total speaker counts (L1+L2).

The practical consequence: complex queries in non-English contexts get answered less precisely — especially in specialist domains like law or public administration. And even though German, French, Spanish and Italian are comparatively privileged as relatively “data-rich” languages, they still share the same structural problems — just in a milder form. There’s a secondary effect worth flagging too: content moderation instructions — for filtering hate speech or detecting statements that indicate serious mental health risk — are primarily designed and trained in English. In other languages, that precision degrades. Things get missed. Or get flagged when they shouldn’t be.

German, French, Spanish, Russian, Japanese and Chinese (all dialects combined) each account for less than 6% of Common Crawl content. European languages are actually somewhat overrepresented relative to their share of the world’s population — for other languages, the picture is far worse. A study presented at AAAI 2025 examined eight African languages — including Amharic, Igbo and Shona — with a combined speaker base of over 160 million people. The authors document a classic rich-get-richer effect: AI models are most useful to English speakers, who produce better content, which trains better models (arXiv 2412.12417). Hindi, spoken by over half a billion people, accounts for just 0.22% of Common Crawl. A CLAWS-Lab study found that GPT-3.5 delivers 38.6% fewer complete responses to Hindi queries than to English ones — a stark illustration of how language inequality translates directly into unequal access to AI.

These issues get amplified with smaller models — what the industry calls Small and Medium Language Models, with under 15 and 100 billion parameters respectively. These are well-suited for RAG-based standalone deployments, which sidestep the data privacy headaches of cloud solutions. But in non-English, they’re even clumsier than their larger counterparts. There are some targeted fixes — like the embeddings from Berlin-based Jina for the small Gemma language models — but none of them fully solve the underlying problem. Even Mistral — Europe’s own French-built LLM alternative — performs better in English than in its intended target languages, German and French.

Summary: Non-English takes more work

The Brookings Institute used a quote to open its 2024 analysis of the AI language gap — and it still fits:

“The limits of my language mean the limits of my world.”

Ludwig Wittgenstein (1889–1951), Philosopher

The language you work in demonstrably shapes what AI can do for you.. For people who don't speak English, this translates to a material disadvantage in a world where conversational AI is becoming a useful tool for solving problems and creating output.
From a business perspective, it shouldn’t come as a surprise, then, that Non-English conversational AI projects — underperform and often stumble already at the prototype stage. What is consistently underestimated is the additional effort involved: foundational model training for the application’s specific language patterns, a well-designed RAG pipeline, and careful fine-tuning of the language generation. All of that makes good AI implementation more expensive in non-English environments.

It takes time, money, and a deliberate commitment to linguistic diversity in the development process. But the first step is simply acknowledging the gap exists. Ignore it, and the price is an AI application that delivers little — or worse, negative — value.

Many languages, including most European ones, are structurally more complex than English and typically need longer sentences to say the same thing. E.g. German compound nouns consume far more tokens (the processing units a model uses to handle text). “Bildungsministerium” as a single word is harder for a model than “ministry of education” — three simple words. The knock-on effects: higher cost per query, a context window that fills up faster (meaning weaker reasoning), and a demonstrably higher hallucination rate. A 2024 IEEE study identified undertrained tokens as a direct cause of hallucinations in models like GPT-4o on non-English text (arXiv 2406.11214). Hallucinations are frustrating aspects of working with AI: a model that handles topics reliably will, without hesitation and with full confidence, produce answers that are simply wrong and completely unverifiable when there isn’t enough relevant data for the vector search to find solid matches. For well-covered subjects, large models produce very low error rates — somewhere between 1–5% depending on the test. But on specific topics — niche legal questions, lesser-known people, specialist science — rates of up to 50% are not unusual. In underrepresented languages, that effect compounds. Models are most accurate precisely where users already know the answer and can tell right from wrong, and most likely to fabricate content exactly where users have no basis to spot the error.