Supporting Regional Languages: A Story of Collaboration between the Occitan Institute of Culture and Wikidata

The Occitan language or lenga d’òc is a Romance language spoken in Occitania, a region that spans parts of southern France as well as Monaco and parts of Spain and Italy. Today, it is mostly spoken as a second language. Preserving a regional language in the age of the Internet is not a small task, but fortunately digitization offers new possibilities.

  • Léa Lacroix
  • 27. September 2019

This blog post is also available in German, French, and Occitan.

Wikidata, the free knowledge base that can be edited by everyone, offers the possibility to enter lexicographical data to describe elements of languages. Lo Congrès permanent de la lenga Occitana, an institution that helps to bring Occitan vocabulary into the digital age with computer linguistics and natural language processing (NLP), took on the task to create Occitan words (or lexemes) in Wikidata.

The process of importing the ancient Occitan language digitally into Wikidata is not unlike the construction of the fortified city of Carcassonne in Occitania — it’s all about the building blocks that stand the test of time.

In order to give more details about the project, we spoke to Aure and Vincent, who were both involved in preparing the import of Occitan words into Wikidata.

Can you introduce yourself and Lo Congrès, what are you doing?

„Lo Congrès permanent de la lenga Occitana“ is a scientific institution that aims to contribute to the development of Occitan through the production of tools regarding the different aspects of the Occitan language (lexicography, lexicology, terminology, neology, phonology, graphics, grammar and toponymy). We have been producing digital tools such as the online dictionary dicod’Òc, the verb conjugator verb’Òc and natural language processing (NLP) tools.

Vincent Gleizes: I am 25 years old and I live in France. I am a computer science student and more precisely I am moving towards data development and organization. I also went through a bachelor’s degree in Occitan in literature school, which led me to do an internship at Lo Congrès, in order to link two of my passions.

Aure Séguier: And I am Aure Séguier, natural language processing project manager at the Congress.

What kind of language-related data are you working with, for what purpose?

We work with data that is used to create tools in the field of natural language processing. For example, we have monolingual and bilingual dictionaries formatted in TEI, lexicons of flexed forms, corpuses of texts…

Why did you decide to import Occitan words into Wikidata?

In a few years, the Occitan’s formal processing tools will be completed. At that point, it will be necessary to focus on semantic analysis. However, for this to happen, the words of our language must be linked to concepts that can be understood by computers. Occitan does not have the human and financial resources to build a concept base from scratch. Fortunately, there is Wikidata, which is completely free to use. By adding Occitan Lexemes to Wikidata, and linking them to concepts, we offer the possibility for Occitan NLP actors to build tools working on the meaning of a text, such as chatbots, text summary tools, personal assistants.

How did you proceed? What were the different steps?

Vincent: The first step was to get more familiar with the Wikidata API, the lexicographic data organization model and the general functioning of Wikidata. The goal was to know how to reorganize Lo Congrès’ data to match it with the Wikidata data model. Then plenty of interesting question occurred: how to integrate this data, which API function to use for each functionality of the import script? How to check that the requests have worked well? How did the other contributors inform this or that information?

Together with the first task, we had to think about the very concept of how the import script would work, to write the main algorithm, and to answer ground questions such as: What are the characteristics to determine if two Lexemes are identical? I then went on to write the script functions and, after testing them separately, to write the algorithm itself.

The last step was a series of tests/corrections in the Wikidata test environment to highlight the slightest possible error. And finally test phases that were presented to the Wikidata community.

What did you experience during the process, what went well, what kind of issues did you encounter?

Vincent : Overall the project went very well. I would say that everything worked well, and yet it wasn’t easy: I had to write a bot (script) in a language I didn’t know at all and I had to extract the data from file formats I had never worked with before. I didn’t think I would be able to make a program functional after one month (the duration of my internship). But fortunately I was able to receive help and good advice around me which allowed me to progress at a good pace.

The main problem I encountered was mainly my lack of knowledge of Wikidata, its API and documentation. I often found myself stuck with questions like „how to get this information back?“ or „why this request doesn’t work?“ until I dared to ask, and an answer (clear, limpid and that seemed almost obvious) came back to me each time from members of the Wikidata community.

What are the next steps, how are you going to continue working with Wikidata & languages?

Aure: Once the Lexemes are imported, they will have to be linked to the concepts they represent. This will require the involvement of the community, as none of our partner organizations has enough staff to undertake such a project. To make such a task attractive, it will have to be presented in a playful way. Maybe via a mobile application like a game?

What could you do thanks to the imported data? What uses could you imagine?

Aure: Once the Lexemes are linked to the concepts, we can imagine anything. Wikidata will help to disambiguate the meaning of words. For example, if I say I want to buy „a mouse for my computer“ from a program, it will search for which concept is associated with the word „mouse“. He’s going to find two. He will look at which one is linked to another concept associated with the word „computer“. This will let him know that I don’t want to buy a laboratory mouse.

We can thus build tools that summarize texts, classify documents by theme, automatically answer a user’s questions… or even an intelligent personal assistant so that smartphones can also speak Occitan!

Interview held by Léa Lacroix, Nicolas Vigneron and Jens Ohlig.

Leave a comment

No comments yet

Leave a comment