Nepal Bhasa on Google Translate
Saving an endangered indigenous language by using latest AI and LLM toolsA young non-resident Newa person in Dallas emails a New York Times article to his grandfather in Nepal who cannot read English.
Grandad copies the English text, pastes it into Google Translate to read it in Newari.
This translation feature, available to the world’s major languages, can now also be used by the indigenous people of Kathmandu Valley to translate their mother tongue into English and vice versa.
This involved three-and-a-half years of effort and will impact everyday life in the study and preservation of Newa culture. It also shows a path forward to Nepal’s 123 other languages and dialects.
The World Newah Organisation (WNO) in 2011 invited Newa people from around the world to a convention in London. Software engineer and historian Sanyukta Shrestha was involved, and now leads WNO.
The top priority for discussion was how to conserve the language which was classified as ‘definitely endangered’ by UNESCO, with only 850,000 speakers left.
“The Newa people are diverse: there are different sects, religions. The language and the calendar are what is common,” explains Shrestha. “The strategy was to unite the community behind an existing technology.”
Ujjwal Rajbhandari, an engineer at Google who does not even speak the language, got involved. “I thought why not take advantage of a great tool that already exists,” Rajbhandari, who is co-founder at Qubrid AI in Austin, Texas, told us on the phone.
Work started December 2020, but the prognosis was bleak. Engineers said that the project may take up to 10 years. Translate runs on a ‘transformer’, a machine learning neural network designed to take sequences as input, uses math to understand the context, and outputs a probable output. It uses an ‘attention’ mechanism, which looks at the input text and tries to work out the relationships of each word to each other word.
The more data you feed this statistical monster, the better it gets. But that was the first problem. There was a dire lack of digital material, either in Newari, bilingual Newari-Nepali or Newari-English dictionaries.
Read also: Languages are both software and hardware
“Tens of thousands of digital pages of content were needed, so in 2021 we met with the editors of Newa magazines and asked to scan and digitise their material,” Shrestha recalls. “We hunted down writers, publishers, songwriters, invitation cards - anything that was written in Nepalbhasa.”
Google gave WNO a sheet of 1,200 sentences from random contexts, to be translated. WNO formed a committee to complete this task, and by April 2021, Translate could at least identify the language.
Nepalbhasa can be written in 13 different scripts, the two most popular are the Ranjana and Devanagari. The team chose the latter because there is a lot more material in Nepalbhasa in Devanagari, and many more people who can type in it. “Devanagari is what has allowed any of this work at all,” points out Shrestha. “Translation was the much harder problem to solve, converting output to another face is much easier.”
Read also: Lost in translation, Alisha Sijapati
Now the model needed a large volume of input from the community. “Every AI model is dependent on the volume of data, even that which is introduced later. User corrections are fed back, which makes the model more robust,” says Rajbhandari.
This happened to be at peak pandemic-time. People knew how to video-chat and had time on their hands. WNO members met for one or two hours in the evening, on a Zoom call that they also broadcast on Facebook. People worked together translating. Which was very effective, because they could get instant input or feedback from others on the call. “It was cool to see was how this brought together elders who knew the language deeply, and the youngsters who have the tech skills,” says Rajbhandari.
By February 2024 there were 500,000 contributions, and Google deemed that this volume of data was sufficient for the model to ‘level up’. Translate added Nepalbhasa to its list of supported languages in June last year. The language had now been intergrated into a tool that can be used for free from anywhere with an internet connection. Accuracy was unusally high, over 90%.
It opens up a whole different world on the internet, for those who only know Nepalbhasa, since any website integrated with Google can be used automatically and instantly converted to that language.
The translations are already being used in East Europe to translate legal documents, Newa songwriters, like Ujan Shakya, use it to convert English lyrics. And when Rajbhandari speaks to WNO over video, his English is instantly translated into Newari subtitles.
“Older people are happy that language will now not die. Younger ones can use it to self-learn the language, now they can’t blame elders for not teaching them,” says Shrestha.
Yet another use is the integration with large language models, like ChatGPT. LLMs encounter Nepalbhasa in their training data, so they can already generate rudimentary responses.
Says Shrestha, “You can ask questions about a Newa book in English, and the LLM gives you a coherent answer and even cite the page number.” This makes the use of Newa books for research possible and a lot faster.
Leading this mammoth task led Shrestha to a better appreciation for his mother tongue. “I saw the richness of Newari. There are onomatopoeia and proverbs that only elders use, and it is these and the dialects that need to be saved,” he says.
The next step is a text-to-speech, speech-to-text interface, which would open up its use by those who speak and understand Nepalbhasa but may not read, write or type it. Tourists to Nepal would also benefit.
Rajbhandari wants to work on similar projects for other endangered languages of Nepal, and the team has shown that it does not need government to do such work.
Says Rajbhandari: “We have learned many things that would make integration of future languages more efficient so we can solve problems in other projects.”
writer