Webinar: Tokenization and Cross-lingual Learning in LLMs

April 14, 10-11 (CET)

5 February, 12:00-13:00 CET

Garoar Ingvarsson Juto, Ilker Kesen and Andreas Holm from WP4, TrustLLM.

Garðar Ingvarsson Juto (Miðeind), Ilker Kesen (University of Copenhagen), Andreas Holm (Alexandra Institute).

How do LLMs actually learn new languages? How can LLMs learn small languages like Faroese better? Can LLMs learn languages from images rather than text?

In this three-part TrustLLM webinar, we will look at one of the most overlooked parts of LLM training: tokenization. What is often treated as a preprocessing step shapes how a model learns and how it handles languages with very different amounts of training data:

What if we want to teach an LLM an entirely new language? What if we just swap its vocabulary? In part 1 of the webinar, we will walk through how, instead of retraining the entire model, we can keep the Transformer backbone and just help it learn new words while keeping its old behavior the same.

Faroese is a small language that is very different from English. In common tokenization strategies, vocabularies for one specific language can isolate the language and prevent it from benefiting from high-resource languages like English. In part 2 of the webinar, we will go through how we can break that isolation using dynamic tokenization.

What if we don’t use regular tokenization at all? In part 3 of the webinar, we will show how Pixel models learn languages from visual patterns in characters rather than from sentences. It solves problems like out-of-vocabulary words, but it comes with its own challenges.

The webinar will be presented by Garðar Ingvarsson Juto (Miðeind), Ilker Kesen (University of Copenhagen), Andreas Holm (Alexandra Institute).

Please note that this webinar will be recorded.

How can we make large language models more factually reliable? Can better data, external tools, and structured knowledge help reduce hallucinations? This TrustLLM webinar will focus on improving the factual trustworthiness of LLMs.

As large language models are increasingly used in real‑world and high‑stakes settings, their tendency to produce fluent but incorrect information remains a major challenge. This webinar presents three main contributions of the ongoing work from TrustLLM Work Package 3, which tackles factual reliability through three methodological approaches: data curation, tool learning, and structured knowledge extraction. Together, these three perspectives show how better data, external tool integration, and structured knowledge representations can jointly strengthen the factual reliability and trustworthiness of large language models.

The first topic introduces JQL (Judging Quality across Languages). JQL is a scalable method for curating high‑quality multilingual datasets by distilling LLM‑based annotations into lightweight models built on cross‑lingual embeddings. This approach demonstrates how systematic data curation across languages can directly improve the factual grounding of LLMs.
The second contribution explores how structured tool use can help anchor model outputs in real‑world information. Tool learning enables LLMs to interact with external systems—such as retrievers or specialized tools—allowing them to verify facts and reason over up‑to‑date sources rather than relying solely on internal representations.
Finally, we explore knowledge graph construction and ontology learning as a way to enhance factual consistency. By comparing single‑step and multi‑step reasoning strategies, this work investigates how LLMs can more reliably extract structured knowledge from text, supporting downstream reasoning and verification tasks.

Please note that this webinar will be recorded!

More from TrustLLM