FinGPT: Ensuring equitable LLM access

Smaller languages deserve open source LLMs to build applications with...
25 July 2023

Interview with 

Sampo Pyysalo

TALKING

A picture of someone talking, drawn on a blackboard

Share

English is the international language, particularly for science and international communications. So it naturally represents the largest pool of online training data for technology companies to use when training their AI systems.

But, in Finland, their eco-friendly supercomputer ‘LUMI’ is helping a team of researchers to safeguard the Finnish language in the AI industry.

And because it’s trained on a more narrow dataset from the Finnish national library, as well as the web crawls that power other LLM’s. This means it’s also less susceptible to some of the hallucination problems affecting mainstream chatbot. On a recent trip to the country, James Tytko met up with the architect of the new project, the University of Turku’s Sampo Pyysalo…

Sampo - ChatGPT, the latest iteration, actually has remarkable capabilities to operate in languages other than English. You can talk to it also in Finnish. The thing that most motivates our work in building open language models for the Finnish language in particular isn't so much that the systems that the big multinational companies are building wouldn't have any capability for Finnish, but rather that those models are closed, they're not available for research, they're not available as a foundation for building independent applications. And of course they don't really have a focus on smaller languages such as Finnish. So by creating our own open models for smaller languages such as Finnish, we can ensure that the data represents the part of the Finnish language and culture that we find most relevant. And that will hopefully serve as the best basis for models that not only speak the Finnish language, but also to some degree share the perspective of Finland and its population.

James - That's a very interesting point because, to my shame, especially when I come to a country like this, I only speak one language and you come here and people seem to speak four at a minimum.

Sampo - A lot of central European, say Germanic or romance languages, even the smaller languages, are part of bigger language families where they have close neighbours. So there is some expectation that the models will be able to draw on those languages or texts in those languages in order to learn some of the smaller languages. So Finnish is certainly not alone, but it is, to some degree in a unique position within Europe in that it is in a very small language family and it happens to be quite distant from other members of that family. So we think it's quite important that we dedicate resources to having specifically Finnish texts in order to train these types of models.

James - The reason for doing what you're doing, it's got two prongs, really. It's to prevent a kind of cultural apocalypse where we move towards a single language world and destroy so much in the process, but also as a foundation for Finnish artificial intelligence in institutions.

Sampo - Yes, absolutely. So we wish to maintain, to preserve a degree of independence, not only for our language, but also for our academic industry work where we don't become reliant on assistance that are only running on servers in the US and only available via API. So it still very much remains to be seen to what degree these systems will actually form the foundation for a new type of industry or replace work in current ones, but the more they do, the more important I think it is for us to have national and European infrastructure that can compete, at least to a degree, with what the big multinationals are doing.

James - And all this is possible thanks to LUMI and the computational grunt and backing it gives you, is it a case that some other countries which perhaps don't have the Finnish reputation for computing that you do, their languages might be left behind in this?

Sampo - We certainly hope not. We're currently a member of a horizon EU funded project that seeks to develop similar models that would cover at least all official EU languages. So it is our goal to extend what we have now been able to do for Finnish to all European languages and hopefully also beyond.

Comments

Add a comment