Building Language Technology for Icelandic: Insights from Almannarómur

Picture of Ottar

Óttar Kolbeinsson Proppé

At the second annual TrustLLM consortium meeting in Reykjavik last June, we spoke with Óttar Kolbeinsson Proppé, Director of Policy, Coordination and International Relations at Almannarómur – The Icelandic Center for Language Technology. As a keynote speaker, Óttar shared how Iceland is tackling one of the toughest challenges in language technology: developing tools for a small language community – a topic that resonates strongly with TrustLLM’s ambition to build open and trustworthy language models for Germanic languages.

Data is gold

When developing language technology for a small language like Icelandic, the toughest part isn’t the technology itself – it’s finding enough data to make it work. With fewer than 400,000 native speakers, Icelandic is a small language community and overall, there is not much publicly available Icelandic data to begin with. This makes it even more remarkable that Almannarómur, along with its partner institutions, managed to involve a large percentage of the Icelandic population in their data collection process.

“We have had some very, in some ways, creative approaches to doing this and some very different approaches”, says Óttar.

Through help from the Icelandic government and by openly communicating their goals, they were able to convince Icelandic companies, public institutions, and citizens to share data for open use. These steps were challenging, yet they marked a significant success for Iceland. These efforts have been conducted in a close collaboration between all key players in the field of language technology in Iceland, both academia, industry and public service organisations.

The team’s early efforts, starting around 2020, allowed them, for example, to gather an astounding three million recorded sentences, adding up to over 4,000 hours of speech<, from the Icelandic public – an exemplary demonstration of the Icelandic pride in their language and culture.

Once the data existed, the next challenge was convincing companies to use it. For years, the work focused on building foundational models that weren’t perfect or widely used. But now, Iceland had a growing collection of open datasets and models that anyone can build upon – a goldmine for innovation.

How Iceland is navigating data collection today

With developments in AI tools over recent years, data protection has become an increasingly important topic of concern for many. At the moment, speech data gathered by the Icelandic Centre for Language Technology and its partners is enough to support high-quality speech analysis, so large-scale public collection is no longer the main focus of their efforts.

Instead, the team is concentrating on specialised projects. One of these projects is collecting children’s voices, to generate speech for young users with disabilities who rely on speech synthesis. Currently, some children’s synthetic voices are adult voices, which don’t suit a young child in need of a voice. At the time of the interview, the project was still in its early stages, but it highlights how sensitive some types of data can be. The team is working closely with legal experts to ensure every recording is fully protected.

Other projects have a more economic focus, and though interest is growing, it takes persuasion to get companies on board. Icelandic banks, insurance companies, and other sectors are beginning to see the value of sharing non-sensitive text data to improve models tailored to their industry’s language and vocabulary. These conversations are still in the early stages, but the potential is huge.

The role of small language communities in Europe’s AI future

Smaller language communities like Icelandic have far more to contribute to the European AI ecosystem than many people assume. Iceland is a strong example of how a focused national effort can secure the digital future of a low-resource language. By involving the entire society, the country has shown that a language spoken by just 380,000 people can deliver the data and infrastructure needed to support advanced AI tools.

“I’d say the biggest takeaways are: involve the public, keep things open, and don’t underestimate what a focused national effort can achieve even in a small language community”, says Óttar.

Also, a small country with one official language like Iceland can act as an ideal test bed for studying the societal effects of AI. If you wanted to measure the national-, macro-level impacts of tools like ChatGPT or Microsoft Copilot on education, productivity, or the economy, Iceland might be one of the best places to do it.

Iceland’s pride in its language and its collaborative spirit continue to make these efforts possible – and increasingly, highly impactful.

More from TrustLLM