‘This research is highly relevant for Europe’
Many AI models reach their limits with underrepresented languages. Dr Simon Ostermann from the German Research Center for Artificial Intelligence (DFKI) is working on this issue. In this interview, the researcher discusses why open language data from the Global South is not only an act of fairness but also a genuine gain in knowledge for AI research.
Why are open-source language projects an important research topic for the DFKI?
Open-source language projects are particularly interesting for low-resource languages. They question existing assumptions of AI research as many common models and methods have been developed for a few dominant languages and are not easily transferable. Our work with low-resource languages opens up new research questions, for example, on the robustness of language models, on low-data scenarios and on language diversity and multilinguality. Such projects thus contribute directly to further developing fundamental AI methods. Open-source language data from the Global South broadens the empirical basis significantly. It helps reduce distortions in models and develop AI systems that can be used globally.
So German and European institutions benefit from this as well?
Especially for Europe, the research based on such language data is highly relevant because many of the languages spoken in the EU are still underrepresented. German and European research institutions benefit because it allows them to develop models that are more realistic, fair and scientifically robust. Furthermore, new possibilities for comparison are emerging beyond language families and cultural contexts.
FAIR Forward wants to strengthen open-source and trustworthy AI systems. How is this achieved?
FAIR Forward adheres to principles such as openness, transparency and sustainability in handling training data. Documentation, data quality and ethical responsibility are significantly emphasised. In my view, this creates new common standards that facilitate international collaboration by clearly defining expectations and working methods across countries and continents. At the same time, it lowers barriers for long-term cooperation between research, civil society, and public actors.
Dr Simon Ostermann is a senior researcher and Interim Head of the Research Department Multilinguality and Language Technology at the German Research Center for Artificial Intelligence in Saarbrücken. He heads the research group for efficient and explainable natural language processing and is interim chair holder for translation-oriented language technologies at Saarland University. His work covers research on explainable AI, the efficiency of language models and the improvement of language technologies for low-resource languages.
Where do you see potential for closer cooperation between German research institutions and partners in countries like India, where FAIR Forward cooperates with the Indian Institute of Science, for example?
I see good potential in jointly developing and maintaining open-source datasets and language models that have an international, multilingual and multicultural focus right from the outset. Establishing joint research infrastructures, for example, for compute resources and data platforms, also offers opportunities for sustainable cooperation. In addition, bilateral research projects can help to bring various different perspectives on AI systems together and to develop new methodological approaches together – although, naturally, this always depends on whether the right support mechanisms are available.
Many applications of artificial intelligence are built on language data. The underlying data base consists primarily of widely spoken languages. Other languages are digitally underrepresented, meaning there is only a small amount of digital content in these languages from which AI can learn. Therefore, they are effectively absent from AI models. Through the BMZ initiative FAIR Forward – Artificial Intelligence for All, implemented by GIZ, new standards for open and trustworthy AI systems are being developed. This enables people in countries of the Global South to gain better access to AI applications.
The German Research Center for Artificial Intelligence (DFKI) was involved in the FAIR Forward initiative primarily as a scientific and technical partner. The DFKI contributed expertise in language data, multilingual models and open-source AI technologies. Technical support was ensured through close cooperation with local partners, jointly defined requirements and advisory services for data collection and processing and for the development of open-source language resources. Important aspects also included knowledge transfer, for example, in workshops and expert discussions, and the provision of compute resources on the DFKI’s servers.