-
From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models
Authors:
Mehar Bhatia,
Sahithya Ravi,
Aditya Chinchure,
Eunjeong Hwang,
Vered Shwartz
Abstract:
Despite recent advancements in vision-language models, their performance remains suboptimal on images from non-western cultures due to underrepresentation in training datasets. Various benchmarks have been proposed to test models' cultural inclusivity, but they have limited coverage of cultures and do not adequately assess cultural diversity across universal as well as culture-specific local conce…
▽ More
Despite recent advancements in vision-language models, their performance remains suboptimal on images from non-western cultures due to underrepresentation in training datasets. Various benchmarks have been proposed to test models' cultural inclusivity, but they have limited coverage of cultures and do not adequately assess cultural diversity across universal as well as culture-specific local concepts. To address these limitations, we introduce the GlobalRG benchmark, comprising two challenging tasks: retrieval across universals and cultural visual grounding. The former task entails retrieving culturally diverse images for universal concepts from 50 countries, while the latter aims at grounding culture-specific concepts within images from 15 countries. Our evaluation across a wide range of models reveals that the performance varies significantly across cultures -- underscoring the necessity for enhancing multicultural understanding in vision-language models.
△ Less
Submitted 28 June, 2024;
originally announced July 2024.
-
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge
Authors:
Yu Ying Chiu,
Liwei Jiang,
Maria Antoniak,
Chan Young Park,
Shuyue Stella Li,
Mehar Bhatia,
Sahithya Ravi,
Yulia Tsvetkov,
Vered Shwartz,
Yejin Choi
Abstract:
Frontier large language models (LLMs) are developed by researchers and practitioners with skewed cultural backgrounds and on datasets with skewed sources. However, LLMs' (lack of) multicultural knowledge cannot be effectively assessed with current methods for developing benchmarks. Existing multicultural evaluations primarily rely on expensive and restricted human annotations or potentially outdat…
▽ More
Frontier large language models (LLMs) are developed by researchers and practitioners with skewed cultural backgrounds and on datasets with skewed sources. However, LLMs' (lack of) multicultural knowledge cannot be effectively assessed with current methods for developing benchmarks. Existing multicultural evaluations primarily rely on expensive and restricted human annotations or potentially outdated internet resources. Thus, they struggle to capture the intricacy, dynamics, and diversity of cultural norms. LLM-generated benchmarks are promising, yet risk propagating the same biases they are meant to measure. To synergize the creativity and expert cultural knowledge of human annotators and the scalability and standardizability of LLM-based automation, we introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build truly challenging evaluation dataset for assessing the multicultural knowledge of LLMs, while improving annotators' capabilities and experiences. Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions, that modern LLMs fail at, in a gamified manner. Importantly, the increased level of AI assistance (e.g., LLM-generated revision hints) empowers users to create more difficult questions with enhanced perceived creativity of themselves, shedding light on the promises of involving heavier AI assistance in modern evaluation dataset creation procedures. Through a series of 1-hour workshop sessions, we gather CULTURALBENCH-V0.1, a compact yet high-quality evaluation dataset with users' red-teaming attempts, that different families of modern LLMs perform with accuracy ranging from 37.7% to 72.2%, revealing a notable gap in LLMs' multicultural proficiency.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
GD-COMET: A Geo-Diverse Commonsense Inference Model
Authors:
Mehar Bhatia,
Vered Shwartz
Abstract:
With the increasing integration of AI into everyday life, it's becoming crucial to design AI systems that serve users from diverse backgrounds by making them culturally aware. In this paper, we present GD-COMET, a geo-diverse version of the COMET commonsense inference model. GD-COMET goes beyond Western commonsense knowledge and is capable of generating inferences pertaining to a broad range of cu…
▽ More
With the increasing integration of AI into everyday life, it's becoming crucial to design AI systems that serve users from diverse backgrounds by making them culturally aware. In this paper, we present GD-COMET, a geo-diverse version of the COMET commonsense inference model. GD-COMET goes beyond Western commonsense knowledge and is capable of generating inferences pertaining to a broad range of cultures. We demonstrate the effectiveness of GD-COMET through a comprehensive human evaluation across 5 diverse cultures, as well as extrinsic evaluation on a geo-diverse task. The evaluation shows that GD-COMET captures and generates culturally nuanced commonsense knowledge, demonstrating its potential to benefit NLP applications across the board and contribute to making NLP more inclusive.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
A Novel Deep Learning based Model for Erythrocytes Classification and Quantification in Sickle Cell Disease
Authors:
Manish Bhatia,
Balram Meena,
Vipin Kumar Rathi,
Prayag Tiwari,
Amit Kumar Jaiswal,
Shagaf M Ansari,
Ajay Kumar,
Pekka Marttinen
Abstract:
The shape of erythrocytes or red blood cells is altered in several pathological conditions. Therefore, identifying and quantifying different erythrocyte shapes can help diagnose various diseases and assist in designing a treatment strategy. Machine Learning (ML) can be efficiently used to identify and quantify distorted erythrocyte morphologies. In this paper, we proposed a customized deep convolu…
▽ More
The shape of erythrocytes or red blood cells is altered in several pathological conditions. Therefore, identifying and quantifying different erythrocyte shapes can help diagnose various diseases and assist in designing a treatment strategy. Machine Learning (ML) can be efficiently used to identify and quantify distorted erythrocyte morphologies. In this paper, we proposed a customized deep convolutional neural network (CNN) model to classify and quantify the distorted and normal morphology of erythrocytes from the images taken from the blood samples of patients suffering from Sickle cell disease ( SCD). We chose SCD as a model disease condition due to the presence of diverse erythrocyte morphologies in the blood samples of SCD patients. For the analysis, we used 428 raw microscopic images of SCD blood samples and generated the dataset consisting of 10, 377 single-cell images. We focused on three well-defined erythrocyte shapes, including discocytes, oval, and sickle. We used 18 layered deep CNN architecture to identify and quantify these shapes with 81% accuracy, outperforming other models. We also used SHAP and LIME for further interpretability. The proposed model can be helpful for the quick and accurate analysis of SCD blood samples by the clinicians and help them make the right decision for better management of SCD.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
Perception Point: Identifying Critical Learning Periods in Speech for Bilingual Networks
Authors:
Anuj Saraswat,
Mehar Bhatia,
Yaman Kumar Singla,
Changyou Chen,
Rajiv Ratn Shah
Abstract:
Recent studies in speech perception have been closely linked to fields of cognitive psychology, phonology, and phonetics in linguistics. During perceptual attunement, a critical and sensitive developmental trajectory has been examined in bilingual and monolingual infants where they can best discriminate common phonemes. In this paper, we compare and identify these cognitive aspects on deep neural-…
▽ More
Recent studies in speech perception have been closely linked to fields of cognitive psychology, phonology, and phonetics in linguistics. During perceptual attunement, a critical and sensitive developmental trajectory has been examined in bilingual and monolingual infants where they can best discriminate common phonemes. In this paper, we compare and identify these cognitive aspects on deep neural-based visual lip-reading models. We conduct experiments on the two most extensive public visual speech recognition datasets for English and Mandarin. Through our experimental results, we observe a strong correlation between these theories in cognitive psychology and our unique modeling. We inspect how these computational models develop similar phases in speech perception and acquisitions.
△ Less
Submitted 13 October, 2021;
originally announced October 2021.
-
One to rule them all: Towards Joint Indic Language Hate Speech Detection
Authors:
Mehar Bhatia,
Tenzin Singhay Bhotia,
Akshat Agarwal,
Prakash Ramesh,
Shubham Gupta,
Kumar Shridhar,
Felix Laumann,
Ayushman Dash
Abstract:
This paper is a contribution to the Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC) 2021 shared task. Social media today is a hotbed of toxic and hateful conversations, in various languages. Recent news reports have shown that current models struggle to automatically identify hate posted in minority languages. Therefore, efficiently curbing hate speech is a crit…
▽ More
This paper is a contribution to the Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC) 2021 shared task. Social media today is a hotbed of toxic and hateful conversations, in various languages. Recent news reports have shown that current models struggle to automatically identify hate posted in minority languages. Therefore, efficiently curbing hate speech is a critical challenge and problem of interest. We present a multilingual architecture using state-of-the-art transformer language models to jointly learn hate and offensive speech detection across three languages namely, English, Hindi, and Marathi. On the provided testing corpora, we achieve Macro F1 scores of 0.7996, 0.7748, 0.8651 for sub-task 1A and 0.6268, 0.5603 during the fine-grained classification of sub-task 1B. These results show the efficacy of exploiting a multilingual training scheme.
△ Less
Submitted 28 September, 2021;
originally announced September 2021.
-
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems
Authors:
Anubha Kabra,
Mehar Bhatia,
Yaman Kumar,
Junyi Jessy Li,
Rajiv Ratn Shah
Abstract:
Automatic scoring engines have been used for scoring approximately fifteen million test-takers in just the last three years. This number is increasing further due to COVID-19 and the associated automation of education and testing. Despite such wide usage, the AI-based testing literature of these "intelligent" models is highly lacking. Most of the papers proposing new models rely only on quadratic…
▽ More
Automatic scoring engines have been used for scoring approximately fifteen million test-takers in just the last three years. This number is increasing further due to COVID-19 and the associated automation of education and testing. Despite such wide usage, the AI-based testing literature of these "intelligent" models is highly lacking. Most of the papers proposing new models rely only on quadratic weighted kappa (QWK) based agreement with human raters for showing model efficacy. However, this effectively ignores the highly multi-feature nature of essay scoring. Essay scoring depends on features like coherence, grammar, relevance, sufficiency and, vocabulary. To date, there has been no study testing Automated Essay Scoring: AES systems holistically on all these features. With this motivation, we propose a model agnostic adversarial evaluation scheme and associated metrics for AES systems to test their natural language understanding capabilities and overall robustness. We evaluate the current state-of-the-art AES models using the proposed scheme and report the results on five recent models. These models range from feature-engineering-based approaches to the latest deep learning algorithms. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models. On the other hand, irrelevant content, on average, increases the scores, thus showing that the model evaluation strategy and rubrics should be reconsidered. We also ask 200 human raters to score both an original and adversarial response to seeing if humans can detect differences between the two and whether they agree with the scores assigned by auto scores.
△ Less
Submitted 14 November, 2021; v1 submitted 13 July, 2020;
originally announced July 2020.
-
Design and Simulation of an 8-bit Dedicated Processor for calculating the Sine and Cosine of an Angle using the CORDIC Algorithm
Authors:
Aman Chadha,
Divya Jyoti,
M. G. Bhatia
Abstract:
This paper describes the design and simulation of an 8-bit dedicated processor for calculating the Sine and Cosine of an Angle using CORDIC Algorithm (COordinate Rotation DIgital Computer), a simple and efficient algorithm to calculate hyperbolic and trigonometric functions. We have proposed a dedicated processor system, modeled by writing appropriate programs in VHDL, for calculating the Sine and…
▽ More
This paper describes the design and simulation of an 8-bit dedicated processor for calculating the Sine and Cosine of an Angle using CORDIC Algorithm (COordinate Rotation DIgital Computer), a simple and efficient algorithm to calculate hyperbolic and trigonometric functions. We have proposed a dedicated processor system, modeled by writing appropriate programs in VHDL, for calculating the Sine and Cosine of an angle. System simulation was carried out using ModelSim 6.3f and Xilinx ISE Design Suite 12.3. A maximum frequency of 81.353 MHz was reached with a minimum period of 12.292 ns. 126 (3%) slices were used. This paper attempts to survey the existing CORDIC algorithm with an eye towards implementation in Field Programmable Gate Arrays (FPGAs). A brief description of the theory behind the algorithm and the derivation of the Sine and Cosine of an angle using the CORDIC algorithm has been presented. The system can be implemented using Spartan3 XC3S400 with Xilinx ISE 12.3 and VHDL.
△ Less
Submitted 4 November, 2011;
originally announced November 2011.
-
A Cluster-based Approach for Outlier Detection in Dynamic Data Streams (KORM: k-median OutlieR Miner)
Authors:
Parneeta Dhaliwal,
M. P. S. Bhatia,
Priti Bansal
Abstract:
Outlier detection in data streams has gained wide importance presently due to the increasing cases of fraud in various applications of data streams. The techniques for outlier detection have been divided into either statistics based, distance based, density based or deviation based. Till now, most of the work in the field of fraud detection was distance based but it is incompetent from computati…
▽ More
Outlier detection in data streams has gained wide importance presently due to the increasing cases of fraud in various applications of data streams. The techniques for outlier detection have been divided into either statistics based, distance based, density based or deviation based. Till now, most of the work in the field of fraud detection was distance based but it is incompetent from computational point of view. In this paper we introduced a new clustering based approach, which divides the stream in chunks and clusters each chunk using kmedian into variable number of clusters. Instead of storing complete data stream chunk in memory, we replace it with the weighted medians found after mining a data stream chunk and pass that information along with the newly arrived data chunk to the next phase. The weighted medians found in each phase are tested for outlierness and after a given number of phases, it is either declared as a real outlier or an inlier. Our technique is theoretically better than the k-means as it does not fix the number of clusters to k rather gives a range to it and provides a more stable and better solution which runs in poly-logarithmic space.
△ Less
Submitted 21 February, 2010;
originally announced February 2010.