Meet 'OpenHathi', the first ever Hindi large language model
Sarvam AI, an Indian start-up, has launched 'OpenHathi-Hi-v0.1,' the first Hindi Large Language Model (LLM) in the OpenHathi series. This budget-friendly model extends Llama2-7B and offers GPT-3.5-like performance for Indic languages. Founded by Pratyush Kumar and Vivek Raghavan in July 2023, Sarvam AI raised $41 million in Series A funding from Lightspeed Ventures, Peak XV Partners, and Khosla Ventures.
Two-phase training process and performance
The OpenHathi model has a 48K-token extension of Llama2-7B's tokenizer and uses a two-phase training process. First, it aligns randomly initialized Hindi embeddings through embedding alignment. Then, it learns cross-lingual attention across tokens with bilingual language modeling. Sarvam AI says their model performs well in various Hindi tasks while maintaining English proficiency, similar to or better than OpenAI's GPT-3.5.
Collaboration with AI4Bharat and KissanAI
Sarvam AI collaborated with academic partners at AI4Bharat to develop OpenHathi, who provided language resources and benchmarks. The model was fine-tuned with KissanAI using data from a bot that converses with farmers in multiple languages. KissanAI recently launched Dhenu 1.0, an Agriculture Large Language Model designed for Indian agricultural practices, understanding English, Hindi, and Hinglish queries.
Aiming to cater to India's unique needs
Sarvam AI focuses on India's unique needs with a background in AI research and digital infrastructure development. The start-up emphasizes Generative AI integration for various Indian languages and encourages collaborations for domain-specific AI model development using enterprise data. OpenHathi-Hi-v0.1 is a significant step in meeting the linguistic needs of the Indian market, highlighting the potential of AI-driven language models in the country.