Bangalore: A joint research team from IIT Bombay and the Indian Institute of Science has developed a large language model that outperforms GPT-4 and other leading AI systems on tasks involving Indian regional languages, in what researchers are calling a major breakthrough for inclusive artificial intelligence.
The model, named BharatGPT, was trained on a dataset of over 400 billion tokens of text in 22 Indian languages including Hindi, Tamil, Telugu, Bengali, Marathi, Gujarati, and Kannada, along with English. In benchmark tests across translation, question-answering, summarisation, and language understanding tasks, BharatGPT scored significantly higher than GPT-4, Gemini, and Claude on the Indian language tests while remaining competitive on English tasks.
Why This Matters
The development is significant because over 900 million Indians are not proficient in English, meaning they have been largely excluded from the benefits of the current AI revolution. Most leading AI models perform poorly on Indian languages because the training data available for these languages is a tiny fraction of what is available for English.
The IIT Bombay team addressed this by developing new techniques for generating synthetic training data in underrepresented Indian languages, effectively multiplying the available training material by a factor of 30.
The researchers plan to release BharatGPT as an open-source model and are in discussions with the government about deploying it for public services including agriculture advisory, healthcare information, and legal aid.