-
SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages
Authors:
Nedjma Ousidhoum,
Shamsuddeen Hassan Muhammad,
Mohamed Abdalla,
Idris Abdulmumin,
Ibrahim Said Ahmad,
Sanchit Ahuja,
Alham Fikri Aji,
Vladimir Araujo,
Abinew Ali Ayele,
Pavan Baswani,
Meriem Beloucif,
Chris Biemann,
Sofia Bourhim,
Christine De Kock,
Genet Shanko Dekebo,
Oumaima Hourrane,
Gopichand Kanumolu,
Lokesh Madasu,
Samuel Rutunda,
Manish Shrivastava,
Thamar Solorio,
Nirmal Surange,
Hailegnaw Getaneh Tilaye,
Krishnapriya Vishnubhotla,
Genta Winata
, et al. (2 additional authors not shown)
Abstract:
Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present \textit{SemRel}, a new semantic relatedness dat…
▽ More
Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present \textit{SemRel}, a new semantic relatedness dataset collection annotated by native speakers across 13 languages: \textit{Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish,} and \textit{Telugu}. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia -- regions characterised by a relatively limited availability of NLP resources. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. The scores are obtained using a comparative annotation framework. We describe the data collection and annotation processes, challenges when building the datasets, baseline experiments, and their impact and utility in NLP.
△ Less
Submitted 31 May, 2024; v1 submitted 13 February, 2024;
originally announced February 2024.
-
Fine-grained Contract NER using instruction based model
Authors:
Hiranmai Sri Adibhatla,
Pavan Baswani,
Manish Shrivastava
Abstract:
Lately, instruction-based techniques have made significant strides in improving performance in few-shot learning scenarios. They achieve this by bridging the gap between pre-trained language models and fine-tuning for specific downstream tasks. Despite these advancements, the performance of Large Language Models (LLMs) in information extraction tasks like Named Entity Recognition (NER), using prom…
▽ More
Lately, instruction-based techniques have made significant strides in improving performance in few-shot learning scenarios. They achieve this by bridging the gap between pre-trained language models and fine-tuning for specific downstream tasks. Despite these advancements, the performance of Large Language Models (LLMs) in information extraction tasks like Named Entity Recognition (NER), using prompts or instructions, still falls short of supervised baselines. The reason for this performance gap can be attributed to the fundamental disparity between NER and LLMs. NER is inherently a sequence labeling task, where the model must assign entity-type labels to individual tokens within a sentence. In contrast, LLMs are designed as a text generation task. This distinction between semantic labeling and text generation leads to subpar performance. In this paper, we transform the NER task into a text-generation task that can be readily adapted by LLMs. This involves enhancing source sentences with task-specific instructions and answer choices, allowing for the identification of entities and their types within natural language. We harness the strength of LLMs by integrating supervised learning within them. The goal of this combined strategy is to boost the performance of LLMs in extraction tasks like NER while simultaneously addressing hallucination issues often observed in LLM-generated content. A novel corpus Contract NER comprising seven frequently observed contract categories, encompassing named entities associated with 18 distinct legal entity types is released along with our baseline models. Our models and dataset are available to the community for future research * .
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference?
Authors:
Gopichand Kanumolu,
Lokesh Madasu,
Pavan Baswani,
Ananya Mukherjee,
Manish Shrivastava
Abstract:
Fluency is a crucial goal of all Natural Language Generation (NLG) systems. Widely used automatic evaluation metrics fall short in capturing the fluency of machine-generated text. Assessing the fluency of NLG systems poses a challenge since these models are not limited to simply reusing words from the input but may also generate abstractions. Existing reference-based fluency evaluations, such as w…
▽ More
Fluency is a crucial goal of all Natural Language Generation (NLG) systems. Widely used automatic evaluation metrics fall short in capturing the fluency of machine-generated text. Assessing the fluency of NLG systems poses a challenge since these models are not limited to simply reusing words from the input but may also generate abstractions. Existing reference-based fluency evaluations, such as word overlap measures, often exhibit weak correlations with human judgments. This paper adapts an existing unsupervised technique for measuring text fluency without the need for any reference. Our approach leverages various word embeddings and trains language models using Recurrent Neural Network (RNN) architectures. We also experiment with other available multilingual Language Models (LMs). To assess the performance of the models, we conduct a comparative analysis across 10 Indic languages, correlating the obtained fluency scores with human judgments. Our code and human-annotated benchmark test-set for fluency is available at https://github.com/AnanyaCoder/TextFluencyForIndicLanaguges.
△ Less
Submitted 3 December, 2023;
originally announced December 2023.
-
Butterflies: A new source of inspiration for futuristic aerial robotics
Authors:
Chakravarthi Jada,
Lokesh Ch. R. S,
Ashok Urlana,
Shridi Swamy Yerubandi,
Kantha Rao Bora,
Gouse Basha Shaik,
Pavan Baswani,
Balaraju Karri
Abstract:
Nature is an inhabitant for enormous number of species. All the species do perform complex activities with simple and elegant rules for their survival. The property of emergence of collective behavior is remarkably supporting their activities. One form of the collective behaviour is the swarm intelligence -- all agents poses same rules and capabilities. This equality along with local cooperation i…
▽ More
Nature is an inhabitant for enormous number of species. All the species do perform complex activities with simple and elegant rules for their survival. The property of emergence of collective behavior is remarkably supporting their activities. One form of the collective behaviour is the swarm intelligence -- all agents poses same rules and capabilities. This equality along with local cooperation in the agents tremendously leads to achieving global results. Some of the swarm behaviours in the nature includes birds formations , fish school maneuverings, ants movement. Recently, one school of research has studied these behaviours and proposed artificial paradigms such as Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO), Glowworm Swarm Optimization (GSO) etc. Another school of research used these models and designed robotic platforms to detect (locate) multiple signal sources such as light, fire, plume, odour etc. Kinbots platform is one such recent experiment. In the same line of thought, this extended abstract presents the recently proposed butterfly inspired metaphor and corresponding simulations, ongoing experiments with outcomes.
△ Less
Submitted 24 August, 2022;
originally announced September 2022.