A team of researchers at the University of Cape Town has built MzansiLM: the first publicly available AI language model trained from scratch on all 11 of South Africa’s official written languages, challenging decades of digital exclusion for millions of speakers.
Ask a popular AI assistant a question in isiNdebele or Sepedi, and the response is likely to be poor, inconsistent, or simply wrong. For millions of South Africans whose home languages are not English, this has been the reality of the artificial intelligence revolution powerful tools built largely without them in mind.
A research team at the University of Cape Town (UCT) is trying to change that. In early May 2026, the team unveiled MzansiLM, a new AI language model trained specifically on South Africa’s 11 official written languages believed to be the first publicly available decoder-only model to explicitly target all of them. Alongside it came MzansiText, a curated multilingual dataset built from the ground up to support those same languages.
The research has been submitted for presentation at the Language Resources and Evaluation Conference (LREC) in Mallorca, Spain — one of the world’s leading academic forums for natural language processing and computational linguistics.
The Problem: A Data Desert
The gap between English and most African languages in AI systems is, at its core, a data problem. Language models learn from text — enormous volumes of it. English dominates the internet. Mandarin is rising. European languages are heavily represented. African languages, despite being spoken by hundreds of millions of people across the continent, have historically been treated as statistical footnotes in the architecture of machine intelligence.
Nine of South Africa’s 11 official written languages fall into what researchers call the “low-resource” category. Even isiZulu and isiXhosa, which have attracted some academic attention globally, lag far behind English in terms of available training data. Languages like isiNdebele and Sepedi have been almost entirely overlooked.
“In language modelling, languages are considered low resource primarily because there are much fewer and smaller textual datasets available in these languages for training language models.”
— Dr Jan Buys, Senior Lecturer, UCT Department of Computer Science
Dr Buys, one of the project’s lead researchers, acknowledged that MzansiText is still small by global standards. But he emphasised that it is already larger than any previous dataset compiled specifically for South African languages — and that it covers all 11 of them, without exception.
The Languages: Who Gets Covered?
South Africa recognises 11 official written languages under its constitution, ranging from widely spoken languages like isiZulu and Afrikaans to those with smaller digital footprints. MzansiLM covers all of them:
More prior AI coverageLimited prior coverageLargely overlooked
isiZulu, isiXhosa, Afrikaans, English, Sesotho, Setswana, Xitsonga, Sepedi, Tshivenda, siSwati and isiNdebele.
The Team Behind It
The project was led by master’s student Anri Lombard and Dr Jan Buys from UCT’s Department of Computer Science, with Dr Francois Meyer and fellow researcher Simbarashe Mawere rounding out the core team. For Lombard, the work grew directly out of his master’s research into how language model architectures perform under low-resource conditions.
What MzansiLM Can (and Cannot) Do
It is important to be precise about the nature of the breakthrough. MzansiLM is not a chatbot. Unlike consumer-facing tools, it is not designed for open-ended conversation in any language. Rather, it is a base model — a foundational language system that developers and researchers can fine-tune and adapt for specific applications.
“Adapting MzansiLM for a limited use case might be more effective and affordable than relying on proprietary large language models, if you want users to be able to interact with a system in their home language.”
— Dr Francois Meyer, Lecturer, UCT Department of Computer Science
Practical applications could include tools for summarising government documents in Sepedi, annotating legal texts in isiZulu, or building basic customer service systems in Tshivenda — tasks that have been economically and technically out of reach for smaller organisations working in indigenous South African languages.
At 125 million parameters, MzansiLM is modest by the standards of today’s headline-grabbing commercial AI systems. Yet in benchmark tests, the model performed competitively and in some cases punched well above its weight. On isiXhosa text generation tasks, it produced results competitive with encoder-decoder models more than ten times its size.
Why Even Big AI Struggles With South African Languages
The research also sheds light on a broader and much-discussed question: why do even the most powerful commercial AI systems — including those from major technology companies — still stumble when used in languages other than English?
The UCT team’s findings point directly to the data gap. MzansiLM, despite being purpose-built for these languages, can perform well when fine-tuned for specific tasks, but is not yet able to handle general-purpose user interaction reliably. This, the researchers note, is precisely because training data in these languages remains scarce.
“This helps to explain why even larger language models don’t yet work as well when used in languages other than English.”
— Dr Jan Buys
In short: the problem is not that African languages are inherently difficult to model. It is that the global AI industry has not invested in collecting and curating the data necessary to model them well.
A Foundation, Not a Finish Line
The team is careful not to overstate what they have built. MzansiLM is explicitly positioned as a baseline — a starting point from which future, more capable models can be developed. The researchers have made both MzansiText and MzansiLM freely and publicly available, a decision that reflects a deliberate philosophy about how progress gets made in underserved areas of AI research.
“A lot of the progress we were able to make depends on earlier open research from the African Natural Language Processing research community, so continuing that openness is essential.”
— Anri Lombard, Master’s student and project lead
Dr Meyer echoed the call for collective effort, noting that the research community’s willingness to share datasets, models, and findings openly is often what makes progress possible particularly in comparison to proprietary systems where methodologies and data remain inaccessible.
What is needed next, the team says, is better and broader data sources, stronger evaluation benchmarks, and sustained investment in the kind of collaborative infrastructure that allows researchers around the world to build on each other’s work. The path from a 125-million-parameter baseline to a system capable of real-world, general-purpose use in isiNdebele or siSwati is long. But it now, at least, has a clear starting point.
Significance for the Continent
The UCT initiative arrives at a moment of growing awareness within the global AI community that linguistic diversity must become a design priority, not an afterthought. Across Africa, researchers and civil society organisations have warned that AI systems optimised for English and a handful of other dominant languages risk deepening existing inequalities creating a two-tier information society in which access to functional AI depends on which language you happen to speak at home.
For South Africa, a country defined by its multilingual constitution and its history of language as a site of political struggle, the stakes of that question are especially pronounced. MzansiLM does not resolve them. But it marks, credibly and verifiably, a step in a different direction.
