Building a Llama2-finetuned LLM for Odia Language Utilizing Domain Knowledge Instruction Set
Authors:
Guneet Singh Kohli,
Shantipriya Parida,
Sambit Sekhar,
Samirit Saha,
Nipun B Nair,
Parul Agarwal,
Sonal Khosla,
Kusumlata Patiyal,
Debasish Dhal
Abstract:
Building LLMs for languages other than English is in great demand due to the unavailability and performance of multilingual LLMs, such as understanding the local context. The problem is critical for low-resource languages due to the need for instruction sets. In a multilingual country like India, there is a need for LLMs supporting Indic languages to provide generative AI and LLM-based technologie…
▽ More
Building LLMs for languages other than English is in great demand due to the unavailability and performance of multilingual LLMs, such as understanding the local context. The problem is critical for low-resource languages due to the need for instruction sets. In a multilingual country like India, there is a need for LLMs supporting Indic languages to provide generative AI and LLM-based technologies and services to its citizens.
This paper presents our approach of i) generating a large Odia instruction set, including domain knowledge data suitable for LLM fine-tuning, and ii) building a Llama2-finetuned model tailored for enhanced performance in the Odia domain. The proposed work will help researchers build an instruction set and LLM, particularly for Indic languages. We will release the model and instruction set for the public for research and noncommercial purposes.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language
Authors:
Shantipriya Parida,
Idris Abdulmumin,
Shamsuddeen Hassan Muhammad,
Aneesh Bose,
Guneet Singh Kohli,
Ibrahim Said Ahmad,
Ketan Kotwal,
Sayan Deb Sarkar,
Ondřej Bojar,
Habeebah Adamu Kakudi
Abstract:
This paper presents HaVQA, the first multimodal dataset for visual question-answering (VQA) tasks in the Hausa language. The dataset was created by manually translating 6,022 English question-answer pairs, which are associated with 1,555 unique images from the Visual Genome dataset. As a result, the dataset provides 12,044 gold standard English-Hausa parallel sentences that were translated in a fa…
▽ More
This paper presents HaVQA, the first multimodal dataset for visual question-answering (VQA) tasks in the Hausa language. The dataset was created by manually translating 6,022 English question-answer pairs, which are associated with 1,555 unique images from the Visual Genome dataset. As a result, the dataset provides 12,044 gold standard English-Hausa parallel sentences that were translated in a fashion that guarantees their semantic match with the corresponding visual information. We conducted several baseline experiments on the dataset, including visual question answering, visual question elicitation, text-only and multimodal machine translation.
△ Less
Submitted 28 May, 2023;
originally announced May 2023.