-
Two-Pronged Human Evaluation of ChatGPT Self-Correction in Radiology Report Simplification
Authors:
Ziyu Yang,
Santhosh Cherian,
Slobodan Vucetic
Abstract:
Radiology reports are highly technical documents aimed primarily at doctor-doctor communication. There has been an increasing interest in sharing those reports with patients, necessitating providing them patient-friendly simplifications of the original reports. This study explores the suitability of large language models in automatically generating those simplifications. We examine the usefulness…
▽ More
Radiology reports are highly technical documents aimed primarily at doctor-doctor communication. There has been an increasing interest in sharing those reports with patients, necessitating providing them patient-friendly simplifications of the original reports. This study explores the suitability of large language models in automatically generating those simplifications. We examine the usefulness of chain-of-thought and self-correction prompting mechanisms in this domain. We also propose a new evaluation protocol that employs radiologists and laypeople, where radiologists verify the factual correctness of simplifications, and laypeople assess simplicity and comprehension. Our experimental results demonstrate the effectiveness of self-correction prompting in producing high-quality simplifications. Our findings illuminate the preferences of radiologists and laypeople regarding text simplification, informing future research on this topic.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
LLMs' Classification Performance is Overclaimed
Authors:
Hanzi Xu,
Renze Lou,
Jiangshu Du,
Vahid Mahzoon,
Elmira Talebianaraki,
Zhuoan Zhou,
Elizabeth Garrison,
Slobodan Vucetic,
Wenpeng Yin
Abstract:
In many classification tasks designed for AI or human to solve, gold labels are typically included within the label space by default, often posed as "which of the following is correct?" This standard setup has traditionally highlighted the strong performance of advanced AI, particularly top-performing Large Language Models (LLMs), in routine classification tasks. However, when the gold label is in…
▽ More
In many classification tasks designed for AI or human to solve, gold labels are typically included within the label space by default, often posed as "which of the following is correct?" This standard setup has traditionally highlighted the strong performance of advanced AI, particularly top-performing Large Language Models (LLMs), in routine classification tasks. However, when the gold label is intentionally excluded from the label space, it becomes evident that LLMs still attempt to select from the available label candidates, even when none are correct. This raises a pivotal question: Do LLMs truly demonstrate their intelligence in understanding the essence of classification tasks?
In this study, we evaluate both closed-source and open-source LLMs across representative classification tasks, arguing that the perceived performance of LLMs is overstated due to their inability to exhibit the expected comprehension of the task. This paper makes a threefold contribution: i) To our knowledge, this is the first work to identify the limitations of LLMs in classification tasks when gold labels are absent. We define this task as Classify-w/o-Gold and propose it as a new testbed for LLMs. ii) We introduce a benchmark, Know-No, comprising two existing classification tasks and one new task, to evaluate Classify-w/o-Gold. iii) This work defines and advocates for a new evaluation metric, OmniAccuracy, which assesses LLMs' performance in classification tasks both when gold labels are present and absent.
△ Less
Submitted 3 July, 2024; v1 submitted 23 June, 2024;
originally announced June 2024.
-
X-Shot: A Unified System to Handle Frequent, Few-shot and Zero-shot Learning Simultaneously in Classification
Authors:
Hanzi Xu,
Muhao Chen,
Lifu Huang,
Slobodan Vucetic,
Wenpeng Yin
Abstract:
In recent years, few-shot and zero-shot learning, which learn to predict labels with limited annotated instances, have garnered significant attention. Traditional approaches often treat frequent-shot (freq-shot; labels with abundant instances), few-shot, and zero-shot learning as distinct challenges, optimizing systems for just one of these scenarios. Yet, in real-world settings, label occurrences…
▽ More
In recent years, few-shot and zero-shot learning, which learn to predict labels with limited annotated instances, have garnered significant attention. Traditional approaches often treat frequent-shot (freq-shot; labels with abundant instances), few-shot, and zero-shot learning as distinct challenges, optimizing systems for just one of these scenarios. Yet, in real-world settings, label occurrences vary greatly. Some of them might appear thousands of times, while others might only appear sporadically or not at all. For practical deployment, it is crucial that a system can adapt to any label occurrence. We introduce a novel classification challenge: X-shot, reflecting a real-world context where freq-shot, few-shot, and zero-shot labels co-occur without predefined limits. Here, X can span from 0 to positive infinity. The crux of X-shot centers on open-domain generalization and devising a system versatile enough to manage various label scenarios. To solve X-shot, we propose BinBin (Binary INference Based on INstruction following) that leverages the Indirect Supervision from a large collection of NLP tasks via instruction following, bolstered by Weak Supervision provided by large language models. BinBin surpasses previous state-of-the-art techniques on three benchmark datasets across multiple domains. To our knowledge, this is the first work addressing X-shot learning, where X remains variable.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
Collaborative Job Seeking for People with Autism: Challenges and Design Opportunities
Authors:
Zinat Ara,
Amrita Ganguly,
Donna Peppard,
Dongjun Chung,
Slobodan Vucetic,
Vivian Genaro Motti,
Sungsoo Ray Hong
Abstract:
Successful job search results from job seekers' well-shaped social communication. While well-known differences in communication exist between people with autism and neurotypicals, little is known about how people with autism collaborate with their social surroundings to strive in the job market. To better understand the practices and challenges of collaborative job seeking for people with autism,…
▽ More
Successful job search results from job seekers' well-shaped social communication. While well-known differences in communication exist between people with autism and neurotypicals, little is known about how people with autism collaborate with their social surroundings to strive in the job market. To better understand the practices and challenges of collaborative job seeking for people with autism, we interviewed 20 participants including applicants with autism, their social surroundings, and career experts. Through the interviews, we identified social challenges that people with autism face during their job seeking; the social support they leverage to be successful; and the technological limitations that hinder their collaboration. We designed four probes that represent major collaborative features found from the interviews--executive planning, communication, stage-wise preparation, and neurodivergent community formation--and discussed their potential usefulness and impact through three focus groups. We provide implications regarding how our findings can enhance collaborative job seeking experiences for people with autism through new designs.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
Scholar Ranking 2023: Ranking of Computer Science Departments Based on Faculty Citations
Authors:
Sai Shi,
Aniruddha Maiti,
Ashis Kumar Chanda,
Slobodan Vucetic
Abstract:
Scholar Ranking 2023 is the second edition of U.S. Computer Science (CS) departments ranking based on faculty citation measures. Using Google Scholar, we gathered data about publication citations for 5,574 tenure-track faculty from 185 U.S. universities. For each faculty, we extracted their t10 index, defined as the number of citations received by their 10th highest cited paper. For each departmen…
▽ More
Scholar Ranking 2023 is the second edition of U.S. Computer Science (CS) departments ranking based on faculty citation measures. Using Google Scholar, we gathered data about publication citations for 5,574 tenure-track faculty from 185 U.S. universities. For each faculty, we extracted their t10 index, defined as the number of citations received by their 10th highest cited paper. For each department, we calculated four quality metrics: median t10 (m10), the geometric mean of t10 (g10), and the number of well-cited faculty with t10 above 40% (c40) and 60% (c60) of the national average. We fitted a linear regression model using those four measures to match the 2022 U.S. News ranking scores of CS doctoral programs. The resulting model provides Scholar Ranking 2023, which can be found at https://chi.temple.edu/csranking.
△ Less
Submitted 11 January, 2023; v1 submitted 8 January, 2023;
originally announced January 2023.
-
OpenStance: Real-world Zero-shot Stance Detection
Authors:
Hanzi Xu,
Slobodan Vucetic,
Wenpeng Yin
Abstract:
Prior studies of zero-shot stance detection identify the attitude of texts towards unseen topics occurring in the same document corpus. Such task formulation has three limitations: (i) Single domain/dataset. A system is optimized on a particular dataset from a single domain; therefore, the resulting system cannot work well on other datasets; (ii) the model is evaluated on a limited number of unsee…
▽ More
Prior studies of zero-shot stance detection identify the attitude of texts towards unseen topics occurring in the same document corpus. Such task formulation has three limitations: (i) Single domain/dataset. A system is optimized on a particular dataset from a single domain; therefore, the resulting system cannot work well on other datasets; (ii) the model is evaluated on a limited number of unseen topics; (iii) it is assumed that part of the topics has rich annotations, which might be impossible in real-world applications. These drawbacks will lead to an impractical stance detection system that fails to generalize to open domains and open-form topics. This work defines OpenStance: open-domain zero-shot stance detection, aiming to handle stance detection in an open world with neither domain constraints nor topic-specific annotations. The key challenge of OpenStance lies in the open-domain generalization: learning a system with fully unspecific supervision but capable of generalizing to any dataset. To solve OpenStance, we propose to combine indirect supervision, from textual entailment datasets, and weak supervision, from data generated automatically by pre-trained Language Models. Our single system, without any topic-specific supervision, outperforms the supervised method on three popular datasets. To our knowledge, this is the first work that studies stance detection under the open-domain zero-shot setting. All data and code are publicly released.
△ Less
Submitted 25 October, 2022;
originally announced October 2022.
-
Group Activity Recognition in Basketball Tracking Data -- Neural Embeddings in Team Sports (NETS)
Authors:
Sandro Hauri,
Slobodan Vucetic
Abstract:
Like many team sports, basketball involves two groups of players who engage in collaborative and adversarial activities to win a game. Players and teams are executing various complex strategies to gain an advantage over their opponents. Defining, identifying, and analyzing different types of activities is an important task in sports analytics, as it can lead to better strategies and decisions by t…
▽ More
Like many team sports, basketball involves two groups of players who engage in collaborative and adversarial activities to win a game. Players and teams are executing various complex strategies to gain an advantage over their opponents. Defining, identifying, and analyzing different types of activities is an important task in sports analytics, as it can lead to better strategies and decisions by the players and coaching staff. The objective of this paper is to automatically recognize basketball group activities from tracking data representing locations of players and the ball during a game. We propose a novel deep learning approach for group activity recognition (GAR) in team sports called NETS. To efficiently model the player relations in team sports, we combined a Transformer-based architecture with LSTM embedding, and a team-wise pooling layer to recognize the group activity. Training such a neural network generally requires a large amount of annotated data, which incurs high labeling cost. To address scarcity of manual labels, we generate weak-labels and pretrain the neural network on a self-supervised trajectory prediction task. We used a large tracking data set from 632 NBA games to evaluate our approach. The results show that NETS is capable of learning group activities with high accuracy, and that self- and weak-supervised training in NETS have a positive impact on GAR accuracy.
△ Less
Submitted 30 August, 2022;
originally announced September 2022.
-
Learning Semi-Structured Representations of Radiology Reports
Authors:
Tamara Katic,
Martin Pavlovski,
Danijela Sekulic,
Slobodan Vucetic
Abstract:
Beyond their primary diagnostic purpose, radiology reports have been an invaluable source of information in medical research. Given a corpus of radiology reports, researchers are often interested in identifying a subset of reports describing a particular medical finding. Because the space of medical findings in radiology reports is vast and potentially unlimited, recent studies proposed mapping fr…
▽ More
Beyond their primary diagnostic purpose, radiology reports have been an invaluable source of information in medical research. Given a corpus of radiology reports, researchers are often interested in identifying a subset of reports describing a particular medical finding. Because the space of medical findings in radiology reports is vast and potentially unlimited, recent studies proposed mapping free-text statements in radiology reports to semi-structured strings of terms taken from a limited vocabulary. This paper aims to present an approach for the automatic generation of semi-structured representations of radiology reports. The approach consists of matching sentences from radiology reports to manually created semi-structured representations, followed by learning a sequence-to-sequence neural model that maps matched sentences to their semi-structured representations. We evaluated the proposed approach on the OpenI corpus of manually annotated chest x-ray radiology reports. The results indicate that the proposed approach is superior to several baselines, both in terms of (1) quantitative measures such as BLEU, ROUGE, and METEOR and (2) qualitative judgment of a radiologist. The results also demonstrate that the trained model produces reasonable semi-structured representations on an out-of-sample corpus of chest x-ray radiology reports from a different medical provider.
△ Less
Submitted 20 December, 2021;
originally announced December 2021.
-
Multi-Modal Trajectory Prediction of NBA Players
Authors:
Sandro Hauri,
Nemanja Djuric,
Vladan Radosavljevic,
Slobodan Vucetic
Abstract:
National Basketball Association (NBA) players are highly motivated and skilled experts that solve complex decision making problems at every time point during a game. As a step towards understanding how players make their decisions, we focus on their movement trajectories during games. We propose a method that captures the multi-modal behavior of players, where they might consider multiple trajecto…
▽ More
National Basketball Association (NBA) players are highly motivated and skilled experts that solve complex decision making problems at every time point during a game. As a step towards understanding how players make their decisions, we focus on their movement trajectories during games. We propose a method that captures the multi-modal behavior of players, where they might consider multiple trajectories and select the most advantageous one. The method is built on an LSTM-based architecture predicting multiple trajectories and their probabilities, trained by a multi-modal loss function that updates the best trajectories. Experiments on large, fine-grained NBA tracking data show that the proposed method outperforms the state-of-the-art. In addition, the results indicate that the approach generates more realistic trajectories and that it can learn individual playing styles of specific players.
△ Less
Submitted 18 August, 2020;
originally announced August 2020.
-
Cannot Predict Comment Volume of a News Article before (a few) Users Read It
Authors:
Lihong He,
Chen Shen,
Arjun Mukherjee,
Slobodan Vucetic,
Eduard Dragut
Abstract:
Many news outlets allow users to contribute comments on topics about daily world events. News articles are the seeds that spring users' interest to contribute content, i.e., comments. An article may attract an apathetic user engagement (several tens of comments) or a spontaneous fervent user engagement (thousands of comments). In this paper, we study the problem of predicting the total number of u…
▽ More
Many news outlets allow users to contribute comments on topics about daily world events. News articles are the seeds that spring users' interest to contribute content, i.e., comments. An article may attract an apathetic user engagement (several tens of comments) or a spontaneous fervent user engagement (thousands of comments). In this paper, we study the problem of predicting the total number of user comments a news article will receive. Our main insight is that the early dynamics of user comments contribute the most to an accurate prediction, while news article specific factors have surprisingly little influence. This appears to be an interesting and understudied phenomenon: collective social behavior at a news outlet shapes user response and may even downplay the content of an article. We compile and analyze a large number of features, both old and novel from literature. The features span a broad spectrum of facets including news article and comment contents, temporal dynamics, sentiment/linguistic features, and user behaviors. We show that the early arrival rate of comments is the best indicator of the eventual number of comments. We conduct an in-depth analysis of this feature across several dimensions, such as news outlets and news article categories. We show that the relationship between the early rate and the final number of comments as well as the prediction accuracy vary considerably across news outlets and news article categories (e.g., politics, sports, or health).
△ Less
Submitted 17 October, 2020; v1 submitted 14 August, 2020;
originally announced August 2020.
-
Faculty citation measures are highly correlated with peer assessment of computer science doctoral programs
Authors:
Slobodan Vucetic,
Ashis Kumar Chanda,
Shanshan Zhang,
Tian Bai,
Aniruddha Maiti
Abstract:
We study relationship between peer assessment of quality of U.S. Computer Science (CS) doctoral programs and objective measures of research strength of those programs. In Fall 2016 we collected Google Scholar citation data for 4,352 tenure-track CS faculty from 173 U.S. universities. The citations are measured by the t10 index, which represents the number of citations received by the 10th highest…
▽ More
We study relationship between peer assessment of quality of U.S. Computer Science (CS) doctoral programs and objective measures of research strength of those programs. In Fall 2016 we collected Google Scholar citation data for 4,352 tenure-track CS faculty from 173 U.S. universities. The citations are measured by the t10 index, which represents the number of citations received by the 10th highest cited paper of a faculty. To measure the research strength of a CS doctoral program we use 2 groups of citation measures. The first group of measures averages t10 of faculty in a program. Pearson correlation of those measures with the peer assessment of U.S. CS doctoral programs published by the U.S. News in 2014 is as high as 0.890. The second group of measures counts the number of well cited faculty in a program. Pearson correlation of those measures with the peer assessment is as high as 0.909. By combining those two groups of measures using linear regression, we create the Scholar score whose Pearson correlation with the peer assessment is 0.933 and which explains 87.2% of the variance in the peer assessment. Our evaluation shows that the highest 62 ranked CS doctoral programs by the U.S. News peer assessment are much higher correlated with the Scholar score than the next 57 ranked programs, indicating the deficiencies of peer assessment of less-known CS programs. Our results also indicate that university reputation might have a sizeable impact on peer assessment of CS doctoral programs. To promote transparency, the raw data and the codes used in this study are made available to research community at http://www.dabi.temple.edu/~vucetic/CSranking/.
△ Less
Submitted 17 August, 2017;
originally announced August 2017.
-
Semi-supervised Discovery of Informative Tweets During the Emerging Disasters
Authors:
Shanshan Zhang,
Slobodan Vucetic
Abstract:
The first objective towards the effective use of microblogging services such as Twitter for situational awareness during the emerging disasters is discovery of the disaster-related postings. Given the wide range of possible disasters, using a pre-selected set of disaster-related keywords for the discovery is suboptimal. An alternative that we focus on in this work is to train a classifier using a…
▽ More
The first objective towards the effective use of microblogging services such as Twitter for situational awareness during the emerging disasters is discovery of the disaster-related postings. Given the wide range of possible disasters, using a pre-selected set of disaster-related keywords for the discovery is suboptimal. An alternative that we focus on in this work is to train a classifier using a small set of labeled postings that are becoming available as a disaster is emerging. Our hypothesis is that utilizing large quantities of historical microblogs could improve the quality of classification, as compared to training a classifier only on the labeled data. We propose to use unlabeled microblogs to cluster words into a limited number of clusters and use the word clusters as features for classification. To evaluate the proposed semi-supervised approach, we used Twitter data from 6 different disasters. Our results indicate that when the number of labeled tweets is 100 or less, the proposed approach is superior to the standard classification based on the bag or words feature representation. Our results also reveal that the choice of the unlabeled corpus, the choice of word clustering algorithm, and the choice of hyperparameters can have a significant impact on the classification accuracy.
△ Less
Submitted 12 October, 2016;
originally announced October 2016.
-
Non-linear Label Ranking for Large-scale Prediction of Long-Term User Interests
Authors:
Nemanja Djuric,
Mihajlo Grbovic,
Vladan Radosavljevic,
Narayan Bhamidipati,
Slobodan Vucetic
Abstract:
We consider the problem of personalization of online services from the viewpoint of ad targeting, where we seek to find the best ad categories to be shown to each user, resulting in improved user experience and increased advertisers' revenue. We propose to address this problem as a task of ranking the ad categories depending on a user's preference, and introduce a novel label ranking approach capa…
▽ More
We consider the problem of personalization of online services from the viewpoint of ad targeting, where we seek to find the best ad categories to be shown to each user, resulting in improved user experience and increased advertisers' revenue. We propose to address this problem as a task of ranking the ad categories depending on a user's preference, and introduce a novel label ranking approach capable of efficiently learning non-linear, highly accurate models in large-scale settings. Experiments on a real-world advertising data set with more than 3.2 million users show that the proposed algorithm outperforms the existing solutions in terms of both rank loss and top-K retrieval performance, strongly suggesting the benefit of using the proposed model on large-scale ranking problems.
△ Less
Submitted 29 June, 2016;
originally announced June 2016.