-
Diverse Perspectives, Divergent Models: Cross-Cultural Evaluation of Depression Detection on Twitter
Authors:
Nuredin Ali,
Charles Chuankai Zhang,
Ned Mayo,
Stevie Chancellor
Abstract:
Social media data has been used for detecting users with mental disorders, such as depression. Despite the global significance of cross-cultural representation and its potential impact on model performance, publicly available datasets often lack crucial metadata related to this aspect. In this work, we evaluate the generalization of benchmark datasets to build AI models on cross-cultural Twitter d…
▽ More
Social media data has been used for detecting users with mental disorders, such as depression. Despite the global significance of cross-cultural representation and its potential impact on model performance, publicly available datasets often lack crucial metadata related to this aspect. In this work, we evaluate the generalization of benchmark datasets to build AI models on cross-cultural Twitter data. We gather a custom geo-located Twitter dataset of depressed users from seven countries as a test dataset. Our results show that depression detection models do not generalize globally. The models perform worse on Global South users compared to Global North. Pre-trained language models achieve the best generalization compared to Logistic Regression, though still show significant gaps in performance on depressed and non-Western users. We quantify our findings and provide several actionable suggestions to mitigate this issue.
△ Less
Submitted 31 March, 2024;
originally announced June 2024.
-
COVID-19 as Reflected in University President Bulk Email
Authors:
Ruoyan Kong,
Charles Chuankai Zhang,
Jin Kang,
Haiyi Zhu,
Joseph A. Konstan
Abstract:
E-mail ``Messages From the President'' to university students, staff, and faculty have long been used to keep campus communities aware of the latest policies, events, and news. But during the COVID-19 pandemic, as universities quickly closed facilities, sent students home, and canceled travel, these messages took on even greater importance. We report on a content analysis of bulk emails from diffe…
▽ More
E-mail ``Messages From the President'' to university students, staff, and faculty have long been used to keep campus communities aware of the latest policies, events, and news. But during the COVID-19 pandemic, as universities quickly closed facilities, sent students home, and canceled travel, these messages took on even greater importance. We report on a content analysis of bulk emails from different universities' presidents to their students and employees before and in three stages of the pandemic. We find that these messages change as universities move towards and through closure. During the pandemic, 1) presidential bulk emails tend to be more informative, positive, clearer than before; 2) they tend to use more personal and collective language; 3) university presidents tend to mention more local political leaders and fewer other university leaders. Our results can inform research on digital crisis communication and may be useful for researchers interested in automatically identifying crisis situations from communication streams.
△ Less
Submitted 20 January, 2024;
originally announced January 2024.
-
Understanding Structured Knowledge Production: A Case Study of Wikidata's Representation Injustice
Authors:
Jeffrey Jun-jie Ma,
Charles Chuankai Zhang
Abstract:
Wikidata is a multi-language knowledge base that is being edited and maintained by editors from different language communities. Due to the structured nature of its content, the contributions are in various forms, including manual edit, tool-assisted edits, automated edits, etc, with the majority of edits being the import from wiki-internal or external datasets. Due to the outstanding power of bots…
▽ More
Wikidata is a multi-language knowledge base that is being edited and maintained by editors from different language communities. Due to the structured nature of its content, the contributions are in various forms, including manual edit, tool-assisted edits, automated edits, etc, with the majority of edits being the import from wiki-internal or external datasets. Due to the outstanding power of bots and tools reflecting from their large volume of edits, knowledge contributions within Wikidata can easily cause epistemic injustice due to internal and external reasons. In this case study, we compared the coverage and edit history of human pages in two countries. By shedding light on these disparities and offering actionable solutions, our study aims to enhance the fairness and inclusivity of knowledge representation within Wikidata, ultimately contributing to a more equitable and comprehensive global knowledge base.
△ Less
Submitted 5 November, 2023;
originally announced November 2023.
-
Getting the Most from Eye-Tracking: User-Interaction Based Reading Region Estimation Dataset and Models
Authors:
Ruoyan Kong,
Ruixuan Sun,
Charles Chuankai Zhang,
Chen Chen,
Sneha Patri,
Gayathri Gajjela,
Joseph A. Konstan
Abstract:
A single digital newsletter usually contains many messages (regions). Users' reading time spent on, and read level (skip/skim/read-in-detail) of each message is important for platforms to understand their users' interests, personalize their contents, and make recommendations. Based on accurate but expensive-to-collect eyetracker-recorded data, we built models that predict per-region reading time b…
▽ More
A single digital newsletter usually contains many messages (regions). Users' reading time spent on, and read level (skip/skim/read-in-detail) of each message is important for platforms to understand their users' interests, personalize their contents, and make recommendations. Based on accurate but expensive-to-collect eyetracker-recorded data, we built models that predict per-region reading time based on easy-to-collect Javascript browser tracking data.
With eye-tracking, we collected 200k ground-truth datapoints on participants reading news on browsers. Then we trained machine learning and deep learning models to predict message-level reading time based on user interactions like mouse position, scrolling, and clicking. We reached 27\% percentage error in reading time estimation with a two-tower neural network based on user interactions only, against the eye-tracking ground truth data, while the heuristic baselines have around 46\% percentage error. We also discovered the benefits of replacing per-session models with per-timestamp models, and adding user pattern features. We concluded with suggestions on developing message-level reading estimation techniques based on available data.
△ Less
Submitted 12 June, 2023;
originally announced June 2023.