Indian Institute of Technology Madras Faculty has developed artificial intelligence models and datasets for the production of texts in 11 Indian regional languages. Multilingual AI models and datasets built by this project would provide crucial building blocks for students, faculty, start-ups, and industry to work on Indian language technologies and expand the limits of technology. As we move into a new future, our languages need to find an online space. This involves a great deal of creativity in the development of input methods, datasets, and AI models for Indian languages.
Imagine, for example, a learner who places a question on an e-learning website in Tamil or Hindi, or some other number of Indian regional languages. There is a need for software that can automatically process questions written in Indian languages and identify them into particular subjects. Dr. Mitesh Khapra and Dr. Pratyush Kumar are also affiliated with the Robert Bosch Computer Science and Artificial Intelligence Centre.
Adding to that, Dr. Pratyush Kumar, Assistant Professor at the Department of Computer Science and Engineering, IIT Madras, said, “This initiative is one of the few attempts by academia to build and publicly release such large-scale multilingual AI models containing millions of parameters trained on billions of tokens in 11 Indian languages, fully free and open-source. Models benefit from the similarity between Indian languages in order to allow productive use of data. With these models, researchers have been able to expand the state-of-the-art Indian language processing into multiple tasks such as text grouping, mood interpretation, semantic matching, identification of paraphrases, and so on.
The lack of availability of such data has hindered the creation of such models for Indian languages. As a result, the Indian NLP has not been able to make advances at the pace at which it should have progressed. Dr. Anoop Kunchukuttan said, “We also hope that start-ups and social projects operating on Indian language technology will be able to take advantage of our pre-trained models and apply them to unique uses by gathering smaller volumes of in-domain data. The Research Team hopes that this effort will serve as a ‘call for action’ for academia, government, and industry to come together and create more and more robust datasets for Indian languages.
Data powers AI technology, and it’s time to make a significant investment in developing datasets for Indian languages.