Māori Built an AI Voice Model Without Giving Big Tech Their Language — And It's a Blueprint for Every Indigenous Community

Every AI Model Speaks Māori — But None of Them Belong to Māori

Ask ChatGPT to write in te reo Māori and it’ll do a surprisingly decent job. Claude and Perplexity can too. That fluency is built on text and audio produced by Māori communities — scraped, ingested, and processed without their permission, on servers outside New Zealand, through interfaces owned by companies that have never asked, let alone compensated.

For Māori, this isn’t a technical achievement. It’s a theft in progress.

“Our language is the most important conveyor we have for our knowledge,” says Te Taka Keegan, professor at the University of Waikato and codirector of its AI Institute. “Yet we see technology developed outside of Aotearoa get more and more control over the transfer of that knowledge.”

Keegan and his master’s student Kingsley Eng set out to do something that sounds simple but is almost unheard of in AI: build a voice model where the community owns everything. The data, the model, the output — all of it staying with the people who speak that dialect.

The Technical Challenge Was Real

Te reo Māori is a low-resource language — there’s relatively little digital text and audio available for training. But the challenges go deeper than data scarcity.

Vowel length changes meaning entirely. The words for “cake” (keke), “armpit” (kēkē), and “to creak” (kekē) differ only by how long the vowel sounds are. An AI model that gets vowel length wrong doesn’t just sound unnatural — it says the wrong word.

Digraphs don’t work like English. The “wh” in te reo is usually pronounced “f.” An English-trained model will get this wrong unless specifically taught.

Then there’s dialect. Māori isn’t one language — it’s a family of dialects tied to specific places and identities. Keegan’s team chose to build for Waikato-Maniapoto, a specific regional dialect. “It’s in the dialects that you see the real beauty of language,” Keegan says. “They tie it to a specific place and sense of identity.”

Rather than training on scraped data, the team recruited Ngaringi Katipa — a translator, educator, and language mentor — to be the consenting human voice behind the model. One speaker, one dialect, full consent. It’s the anti-thesis of how Big Tech builds language models.

Why This Matters Beyond New Zealand

This isn’t just a NZ story. Keegan and Eng’s approach offers a replicable blueprint for any minority language community:

Consent first. Don’t scrape — recruit willing speakers who understand what the data will be used for.
Own everything. The model, the training data, the outputs — all community-owned.
Dialect-specific. Don’t homogenise a language into one “standard” version. Preserving dialect preserves identity.
Local infrastructure. Process and host within the community’s jurisdiction, not on AWS us-east-1.

There are roughly 3,000 endangered languages worldwide. Most of them will never have enough data for a GPT-scale model. But Keegan’s approach shows you don’t need a GPT-scale model. You need a focused model, built with consent, owned by the community, and accurate enough that people actually want to use it.

The NZ Angle: Te Kāhui Raraunga and the Bigger Framework

This isn’t happening in a vacuum. Iwi leaders, through Te Kāhui Raraunga (the working arm of the Data Iwi Leaders Group), have already unveiled an AI safeguards framework designed to protect Māori data use in AI systems. The framework establishes principles for when and how Māori data can be used, and by whom.

Meanwhile, the AI Forum of NZ has refreshed its “AI Blueprint for Aotearoa” vision to 2030, which includes indigenous data sovereignty as a core principle.

The contrast with Big Tech is stark. OpenAI, Anthropic, and Google all build multilingual models that “speak” te reo Māori — but they own the output, process it overseas, and make decisions about the language without Māori input. Keegan’s model inverts that power dynamic entirely.

The Uncomfortable Truth

Here’s what makes this hard: Keegan’s model will likely be less “capable” than ChatGPT’s Māori. It’s trained on less data, built for one dialect, and doesn’t have the resources of a $300 billion company behind it. If you just want the most fluent Māori chatbot, Big Tech wins on raw performance.

But raw performance isn’t the point. Ownership is. Every time a Māori speaker uses ChatGPT’s Māori, they’re training OpenAI’s model to be better at a language it doesn’t own — and that community gets nothing back. Keegan’s model may be smaller, but every word belongs to the people who spoke it.

That’s a trade-off every indigenous community faces in the AI era: accept the better tool that extracts your language, or build the weaker tool that respects it.

❓ Frequently Asked Questions

Q: Can I use this Māori voice model? The model is being developed at the University of Waikato with community governance. Availability depends on the community’s decisions about access — which is exactly the point.

Q: How is this different from ChatGPT speaking Māori? ChatGPT learned Māori from scraped data it doesn’t own, processes it on overseas servers, and the community has zero control over how the language is used. Keegan’s model is community-owned, locally processed, and dialect-specific.

Q: Why does dialect matter? Standardised te reo Māori (what’s taught in schools and broadcast on TV) is important, but dialects carry specific cultural knowledge tied to place and identity. An AI that only knows “standard” Māori is like a map that only shows highways — you miss all the places people actually live.

🔍 THE BOTTOM LINE

An indigenous community built an AI voice model on their own terms — and showed the world that data sovereignty isn’t just possible, it’s practical. The question isn’t whether Big Tech can speak your language. It’s whether they have the right to.

Sources: