Human-Computer Interaction
See recent articles
- [1] arXiv:2407.11198 [pdf, html, other]
-
Title: A Framework For Discussing LLMs as Tools for Qualitative AnalysisComments: 4 pages, 1 table. Presented at the "LLMs as Research Tools" workshop at CHI 2024 (this https URL)Subjects: Human-Computer Interaction (cs.HC)
We review discourses about the philosophy of science in qualitative research and evidence from cognitive linguistics in order to ground a framework for discussing the use of Large Language Models (LLMs) to support the qualitative analysis process. This framework involves asking two key questions: "is the LLM proposing or refuting a qualitative model?" and "is the human researcher checking the LLM's decision-making directly?". We then discuss an implication of this framework: that using LLMs to surface counter-examples for human review represents a promising space for the adoption of LLMs into the qualitative research process. This space is promising because it is a site of overlap between researchers working from a variety of philosophical assumptions, enabling productive cross-paradigm collaboration on tools and practices.
- [2] arXiv:2407.11205 [pdf, html, other]
-
Title: Impact on clinical guideline adherence of Orient-COVID, a CDSS based on dynamic medical decision trees for COVID19 management: a randomized simulation trialMouin Jammal, Antoine Saab, Cynthia Abi Khalil, Charbel Mourad, Rosy Tsopra, Melody Saikali, Jean-Baptiste LamyComments: 8 pages, 5 figuresSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Background: The adherence of clinicians to clinical practice guidelines is known to be low, including for the management of COVID-19, due to their difficult use at the point of care and their complexity. Clinical decision support systems have been proposed to implement guidelines and improve adherence. One approach is to permit the navigation inside the recommendations, presented as a decision tree, but the size of the tree often limits this approach and may cause erroneous navigation, especially when it does not fit in a single screen. Methods: We proposed an innovative visual interface to allow clinicians easily navigating inside decision trees for the management of COVID-19 patients. It associates a multi-path tree model with the use of the fisheye visual technique, allowing the visualization of large decision trees in a single screen. To evaluate the impact of this tool on guideline adherence, we conducted a randomized controlled trial in a near-real simulation setting, comparing the decisions taken by medical students using Orient-COVID with those taken with paper guidelines or without guidance, when performing on six realistic clinical cases. Results: The results show that paper guidelines had no impact (p=0.97), while Orient-COVID significantly improved the guideline adherence compared to both other groups (p<0.0003). A significant impact of Orient-COVID was identified on several key points during the management of COVID-19: ordering troponin lab tests, prescribing anticoagulant and oxygen therapy. A multifactor analysis showed no difference between male and female participants. Conclusions: The use of an interactive decision tree for the management of COVID-19 significantly improved the clinician adherence to guidelines. Future works will focus on the integration of the system to electronic health records and on the adaptation of the system to other clinical conditions.
- [3] arXiv:2407.11218 [pdf, other]
-
Title: You'll Never Walk Alone: An Experiment on Controlling the Mobile Robot 'Spot' with Voice and GesturesRenchi Zhang, Jesse van der Linden, Dimitra Dodou, Harleigh Seyffert, Yke Bauke Eisma, Joost C. F. de WinterSubjects: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
Robots are becoming increasingly intelligent and can autonomously perform tasks such as navigating between locations. However, human oversight remains crucial. This study compared two hands-free methods for directing mobile robots: voice control and gesture control. These methods were tested with the human stationary and walking freely. We hypothesized that walking with the robot would lead to higher intuitiveness ratings and better task performance due to increased stimulus-response compatibility, assuming humans align themselves with the robot. In a 2x2 within-subject design, 218 participants guided the quadrupedal robot Spot using 90 degrees rotation and walk-forward commands. After each trial, participants rated the intuitiveness of the command mapping, while post-experiment interviews were used to gather the participants' preferences. Results showed that voice control combined with walking with Spot was the most favored and intuitive, while gesture control while standing caused confusion for left/right commands. Despite this, 29% of participants preferred gesture control, citing task engagement and visual congruence as reasons. An odometry-based analysis revealed that participants aligned behind Spot, particularly in the gesture control condition, when allowed to walk. In conclusion, voice control with walking produced the best outcomes. Improving physical ergonomics and adjusting gesture types could improve the effectiveness of gesture control.
- [4] arXiv:2407.11225 [pdf, html, other]
-
Title: (De)Noise: Moderating the Inconsistency Between Human Decision-MakersComments: To appear in CSCW 2024Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
Prior research in psychology has found that people's decisions are often inconsistent. An individual's decisions vary across time, and decisions vary even more across people. Inconsistencies have been identified not only in subjective matters, like matters of taste, but also in settings one might expect to be more objective, such as sentencing, job performance evaluations, or real estate appraisals. In our study, we explore whether algorithmic decision aids can be used to moderate the degree of inconsistency in human decision-making in the context of real estate appraisal. In a large-scale human-subject experiment, we study how different forms of algorithmic assistance influence the way that people review and update their estimates of real estate prices. We find that both (i) asking respondents to review their estimates in a series of algorithmically chosen pairwise comparisons and (ii) providing respondents with traditional machine advice are effective strategies for influencing human responses. Compared to simply reviewing initial estimates one by one, the aforementioned strategies lead to (i) a higher propensity to update initial estimates, (ii) a higher accuracy of post-review estimates, and (iii) a higher degree of consistency between the post-review estimates of different respondents. While these effects are more pronounced with traditional machine advice, the approach of reviewing algorithmically chosen pairs can be implemented in a wider range of settings, since it does not require access to ground truth data.
- [5] arXiv:2407.11387 [pdf, html, other]
-
Title: A Framework for Evaluating Appropriateness, Trustworthiness, and Safety in Mental Wellness AI ChatbotsSubjects: Human-Computer Interaction (cs.HC)
Large language model (LLM) chatbots are susceptible to biases and hallucinations, but current evaluations of mental wellness technologies lack comprehensive case studies to evaluate their practical applications. Here, we address this gap by introducing the MHealth-EVAL framework, a new role-play based interactive evaluation method designed specifically for evaluating the appropriateness, trustworthiness, and safety of mental wellness chatbots. We also introduce Psyfy, a new chatbot leveraging LLMs to facilitate transdiagnostic Cognitive Behavioral Therapy (CBT). We demonstrate the MHealth-EVAL framework's utility through a comparative study of two versions of Psyfy against standard baseline chatbots. Our results showed that Psyfy chatbots outperformed the baseline chatbots in delivering appropriate responses, engaging users, and avoiding untrustworthy responses. However, both Psyfy and the baseline chatbots exhibited some limitations, such as providing predominantly US-centric resources. While Psyfy chatbots were able to identify most unsafe situations and avoid giving unsafe responses, they sometimes struggled to recognize subtle harmful intentions when prompted in role play scenarios. Our study demonstrates a practical application of the MHealth-EVAL framework and showcases Psyfy's utility in harnessing LLMs to enhance user engagement and provide flexible and appropriate responses aligned with an evidence-based CBT approach.
- [6] arXiv:2407.11396 [pdf, html, other]
-
Title: It might be balanced, but is it actually good? An Empirical Evaluation of Game Level BalancingComments: 4 pages, 2 figures, 1 table. Accepted at the IEEE Conference on Games (IEEE CoG) 2024Subjects: Human-Computer Interaction (cs.HC)
Achieving optimal balance in games is essential to their success, yet reliant on extensive manual work and playtesting. To facilitate this process, the Procedural Content Generation via Reinforcement Learning (PCGRL) framework has recently been effectively used to improve the balance of existing game levels. This approach, however, only assesses balance heuristically, neglecting actual human perception. For this reason, this work presents a survey to empirically evaluate the created content paired with human playtesting. Participants in four different scenarios are asked about their perception of changes made to the level both before and after balancing, and vice versa. Based on descriptive and statistical analysis, our findings indicate that the PCGRL-based balancing positively influences players' perceived balance for most scenarios, albeit with differences in aspects of the balancing between scenarios.
- [7] arXiv:2407.11467 [pdf, html, other]
-
Title: TEXasGAN: Tactile Texture Exploration and Synthesis System Using Generative Adversarial NetworkSubjects: Human-Computer Interaction (cs.HC)
To create more realistic experiences in human-virtual object interactions, texture rendering has become a research hotspot in recent years. Different frequency components of designed vibrations can activate texture-related sensations due to similar receptors. However, designing specific vibrations for numerous real-world materials is impractical. Therefore, this study proposes a human-in-the-loop vibration generation model based on user preferences. To enable users to easily control the generation of vibration samples with large parameter spaces, we introduce an optimization model based on Differential Subspace Search (DSS) and Generative Adversarial Network (GAN). With DSS, users can use a one-dimensional slider to easily modify the high-dimensional latent space so that the GAN can generate desired vibrations. We trained the generative model using a open dataset of tactile vibration data and selected five types of vibrations as target samples for the generation experiment. Extensive user experiments were conducted using the generated and real samples. The results indicate that our system can generate distinguishable samples that match the target characteristics. Moreover, the results also reveal a correlation between subjects' ability to distinguish real samples and their ability to distinguish generated samples.
- [8] arXiv:2407.11497 [pdf, html, other]
-
Title: "I Came Across a Junk": Understanding Design Flaws of Data Visualization from the Public's PerspectiveSubjects: Human-Computer Interaction (cs.HC); Graphics (cs.GR)
The visualization community has a rich history of reflecting upon flaws of visualization design, and research in this direction has remained lively until now. However, three main gaps still exist. First, most existing work characterizes design flaws from the perspective of researchers rather than the perspective of general users. Second, little work has been done to infer why these design flaws occur. Third, due to problems such as unclear terminology and ambiguous research scope, a better framework that systematically outlines various design flaws and helps distinguish different types of flaws is desired. To address the above gaps, this work investigated visualization design flaws through the lens of the public, constructed a framework to summarize and categorize the identified flaws, and explored why these flaws occur. Specifically, we analyzed 2227 flawed data visualizations collected from an online gallery and derived a design task-associated taxonomy containing 76 specific design flaws. These flaws were further classified into three high-level categories (i.e., misinformation, uninformativeness, unsociableness) and ten subcategories (e.g., inaccuracy, unfairness, ambiguity). Next, we organized five focus groups to explore why these design flaws occur and identified seven causes of the flaws. Finally, we proposed a set of reflections and implications arising from the research.
- [9] arXiv:2407.11612 [pdf, html, other]
-
Title: Improving Engagement and Efficacy of mHealth Micro-Interventions for Stress Coping: an In-The-Wild StudySubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Sustaining long-term user engagement with mobile health (mHealth) interventions while preserving their high efficacy remains an ongoing challenge in real-world well-being applications. To address this issue, we introduce a new algorithm, the Personalized, Context-Aware Recommender (PCAR), for intervention selection and evaluate its performance in a field experiment. In a four-week, in-the-wild experiment involving 29 parents of young children, we delivered personalized stress-reducing micro-interventions through a mobile chatbot. We assessed their impact on stress reduction using momentary stress level ecological momentary assessments (EMAs) before and after each intervention. Our findings demonstrate the superiority of PCAR intervention selection in enhancing the engagement and efficacy of mHealth micro-interventions to stress coping compared to random intervention selection and a control group that did not receive any intervention. Furthermore, we show that even brief, one-minute interventions can significantly reduce perceived stress levels (p=0.001). We observe that individuals are most receptive to one-minute interventions during transitional periods between activities, such as transitioning from afternoon activities to bedtime routines. Our study contributes to the literature by introducing a personalized context-aware intervention selection algorithm that improves engagement and efficacy of mHealth interventions, identifying key timing for stress interventions, and offering insights into mechanisms to improve stress coping.
- [10] arXiv:2407.11748 [pdf, html, other]
-
Title: Ubiquitous Metadata: Design and Fabrication of Embedded Markers for Real-World Object Identification and InteractionComments: MIT PhD ThesisSubjects: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Graphics (cs.GR)
The convergence of the physical and digital realms has ushered in a new era of immersive experiences and seamless interactions. As the boundaries between the real world and virtual environments blur and result in a "mixed reality," there arises a need for robust and efficient methods to connect physical objects with their virtual counterparts. In this thesis, we present a novel approach to bridging this gap through the design, fabrication, and detection of embedded machine-readable markers.
We categorize the proposed marking approaches into three distinct categories: natural markers, structural markers, and internal markers. Natural markers, such as those used in SensiCut, are inherent fingerprints of objects repurposed as machine-readable identifiers, while structural markers, such as StructCode and G-ID, leverage the structural artifacts in objects that emerge during the fabrication process itself. Internal markers, such as InfraredTag and BrightMarker, are embedded inside fabricated objects using specialized materials. Leveraging a combination of methods from computer vision, machine learning, computational imaging, and material science, the presented approaches offer robust and versatile solutions for object identification, tracking, and interaction.
These markers, seamlessly integrated into real-world objects, effectively communicate an object's identity, origin, function, and interaction, functioning as gateways to "ubiquitous metadata" - a concept where metadata is embedded into physical objects, similar to metadata in digital files. Across the different chapters, we demonstrate the applications of the presented methods in diverse domains, including product design, manufacturing, retail, logistics, education, entertainment, security, and sustainability. - [11] arXiv:2407.11837 [pdf, html, other]
-
Title: The Patchkeeper: An Integrated Wearable Electronic Stethoscope with Multiple SensorsComments: Submitted for IEEE Sensors Conference 2024Subjects: Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
Many parts of human body generate internal sound during biological processes, which are rich sources of information for understanding health and wellbeing. Despite a long history of development and usage of stethoscopes, there is still a lack of proper tools for recording internal body sound together with complementary sensors for long term monitoring. In this paper, we show our development of a wearable electronic stethoscope, coined Patchkeeper (PK), that can be used for internal body sound recording over long periods of time. Patchkeeper also integrates several state-of-the-art biological sensors, including electrocardiogram (ECG), photoplethysmography (PPG), and inertial measurement unit (IMU) sensors. As a wearable device, Patchkeeper can be placed on various parts of the body to collect sound from particular organs, including heart, lung, stomach, and joints etc. We show in this paper that several vital signals can be recorded simultaneously with high quality. As Patchkeeper can be operated directly by the user, e.g. without involving health care professionals, we believe it could be a useful tool for telemedicine and remote diagnostics.
New submissions for Wednesday, 17 July 2024 (showing 11 of 11 entries )
- [12] arXiv:2407.10989 (cross-list from cs.CL) [pdf, other]
-
Title: Do Large Language Models Understand Verbal Indicators of Romantic Attraction?Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
What makes people 'click' on a first date and become mutually attracted to one another? While understanding and predicting the dynamics of romantic interactions used to be exclusive to human judgment, we show that Large Language Models (LLMs) can detect romantic attraction during brief getting-to-know-you interactions. Examining data from 964 speed dates, we show that ChatGPT (and Claude 3) can predict both objective and subjective indicators of speed dating success (r=0.12-0.23). ChatGPT's predictions of actual matching (i.e., the exchange of contact information) were not only on par with those of human judges who had access to the same information but incremental to speed daters' own predictions. While some of the variance in ChatGPT's predictions can be explained by common content dimensions (such as the valence of the conversations) the fact that there remains a substantial proportion of unexplained variance suggests that ChatGPT also picks up on conversational dynamics. In addition, ChatGPT's judgments showed substantial overlap with those made by the human observers (mean r=0.29), highlighting similarities in their representation of romantic attraction that is, partially, independent of accuracy.
- [13] arXiv:2407.10994 (cross-list from cs.CL) [pdf, html, other]
-
Title: Panza: A Personalized Text Writing Assistant via Data Playback and Local Fine-TuningArmand Nicolicioiu, Eugenia Iofinova, Eldar Kurtic, Mahdi Nikdan, Andrei Panferov, Ilia Markov, Nir Shavit, Dan AlistarhComments: Panza is available at this https URLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
The availability of powerful open-source large language models (LLMs) opens exciting use-cases, such as automated personal assistants that adapt to the user's unique data and demands. Two key desiderata for such assistants are personalization-in the sense that the assistant should reflect the user's own style-and privacy-in the sense that users may prefer to always store their personal data locally, on their own computing device. We present a new design for such an automated assistant, for the specific use case of personal assistant for email generation, which we call Panza. Specifically, Panza can be both trained and inferenced locally on commodity hardware, and is personalized to the user's writing style. Panza's personalization features are based on a new technique called data playback, which allows us to fine-tune an LLM to better reflect a user's writing style using limited data. We show that, by combining efficient fine-tuning and inference methods, Panza can be executed entirely locally using limited resources-specifically, it can be executed within the same resources as a free Google Colab instance. Finally, our key methodological contribution is a careful study of evaluation metrics, and of how different choices of system components (e.g. the use of Retrieval-Augmented Generation or different fine-tuning approaches) impact the system's performance.
- [14] arXiv:2407.10996 (cross-list from cs.CL) [pdf, html, other]
-
Title: Visualization Literacy of Multimodal Large Language Models: A Comparative StudySubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
The recent introduction of multimodal large language models (MLLMs) combine the inherent power of large language models (LLMs) with the renewed capabilities to reason about the multimodal context. The potential usage scenarios for MLLMs significantly outpace their text-only counterparts. Many recent works in visualization have demonstrated MLLMs' capability to understand and interpret visualization results and explain the content of the visualization to users in natural language. In the machine learning community, the general vision capabilities of MLLMs have been evaluated and tested through various visual understanding benchmarks. However, the ability of MLLMs to accomplish specific visualization tasks based on visual perception has not been properly explored and evaluated, particularly, from a visualization-centric perspective.
In this work, we aim to fill the gap by utilizing the concept of visualization literacy to evaluate MLLMs. We assess MLLMs' performance over two popular visualization literacy evaluation datasets (VLAT and mini-VLAT). Under the framework of visualization literacy, we develop a general setup to compare different multimodal large language models (e.g., GPT4-o, Claude 3 Opus, Gemini 1.5 Pro) as well as against existing human baselines. Our study demonstrates MLLMs' competitive performance in visualization literacy, where they outperform humans in certain tasks such as identifying correlations, clusters, and hierarchical structures. - [15] arXiv:2407.11000 (cross-list from cs.CL) [pdf, other]
-
Title: Autonomous Prompt Engineering in Large Language ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Prompt engineering is a crucial yet challenging task for optimizing the performance of large language models (LLMs) on customized tasks. This pioneering research introduces the Automatic Prompt Engineering Toolbox (APET), which enables GPT-4 to autonomously apply prompt engineering techniques. By leveraging sophisticated strategies such as Expert Prompting, Chain of Thought, and Tree of Thoughts, APET empowers GPT-4 to dynamically optimize prompts, resulting in substantial improvements in tasks like Word Sorting (4.4% increase) and Geometric Shapes (6.8% increase). Despite encountering challenges in complex tasks such as Checkmate in One (-14.8%), these findings demonstrate the transformative potential of APET in automating complex prompt optimization processes without the use of external data. Overall, this research represents a significant leap in AI development, presenting a robust framework for future innovations in autonomous AI systems and highlighting the ability of GPT-4 to bring prompt engineering theory to practice. It establishes a foundation for enhancing performance in complex task performance and broadening the practical applications of these techniques in real-world scenarios.
- [16] arXiv:2407.11013 (cross-list from cs.LG) [pdf, html, other]
-
Title: Quantum-tunnelling deep neural networks for sociophysical neuromorphic AISubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE); Physics and Society (physics.soc-ph); Quantum Physics (quant-ph)
The discovery of the quantum tunnelling effect -- the transmission of particles through a high potential barrier -- was one of the most impressive achievements of quantum mechanics made in the 1920s. Responding to the contemporary challenges, I introduce a novel deep neural network (DNN) architecture that processes information using the effect of quantum tunnelling. I demonstrate the ability of the quantum tunnelling DNN (QT-DNN) to recognise optical illusions like a human. Hardware implementation of QT-DNN is expected to result in an inexpensive and energy-efficient neuromorphic chip suitable for applications in autonomous vehicles. The optical illusions recognition tests developed in this paper should lay foundations for cognitive benchmarking tasks for AI systems of the future, benefiting the fields of sociophysics and behavioural science.
- [17] arXiv:2407.11204 (cross-list from cs.CV) [pdf, html, other]
-
Title: EyeDentify: A Dataset for Pupil Diameter Estimation based on Webcam ImagesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
In this work, we introduce EyeDentify, a dataset specifically designed for pupil diameter estimation based on webcam images. EyeDentify addresses the lack of available datasets for pupil diameter estimation, a crucial domain for understanding physiological and psychological states traditionally dominated by highly specialized sensor systems such as Tobii. Unlike these advanced sensor systems and associated costs, webcam images are more commonly found in practice. Yet, deep learning models that can estimate pupil diameters using standard webcam data are scarce. By providing a dataset of cropped eye images alongside corresponding pupil diameter information, EyeDentify enables the development and refinement of models designed specifically for less-equipped environments, democratizing pupil diameter estimation by making it more accessible and broadly applicable, which in turn contributes to multiple domains of understanding human activity and supporting healthcare. Our dataset is available at this https URL.
- [18] arXiv:2407.11229 (cross-list from cs.CL) [pdf, html, other]
-
Title: Unraveling the Truth: Do LLMs really Understand Charts? A Deep Dive into Consistency and RobustnessComments: 22 pages, 7 Tables, 3 Figures, 25 examplesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Chart question answering (CQA) is a crucial area of Visual Language Understanding. However, the robustness and consistency of current Visual Language Models (VLMs) in this field remain under-explored. This paper evaluates state-of-the-art VLMs on comprehensive datasets, developed specifically for this study, encompassing diverse question categories and chart formats. We investigate two key aspects: 1) the models' ability to handle varying levels of chart and question complexity, and 2) their robustness across different visual representations of the same underlying data. Our analysis reveals significant performance variations based on question and chart types, highlighting both strengths and weaknesses of current models. Additionally, we identify areas for improvement and propose future research directions to build more robust and reliable CQA systems. This study sheds light on the limitations of current models and paves the way for future advancements in the field.
- [19] arXiv:2407.11442 (cross-list from cs.AI) [pdf, html, other]
-
Title: EARN Fairness: Explaining, Asking, Reviewing and Negotiating Artificial Intelligence Fairness Metrics Among StakeholdersSubjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Numerous fairness metrics have been proposed and employed by artificial intelligence (AI) experts to quantitatively measure bias and define fairness in AI models. Recognizing the need to accommodate stakeholders' diverse fairness understandings, efforts are underway to solicit their input. However, conveying AI fairness metrics to stakeholders without AI expertise, capturing their personal preferences, and seeking a collective consensus remain challenging and underexplored. To bridge this gap, we propose a new framework, EARN Fairness, which facilitates collective metric decisions among stakeholders without requiring AI expertise. The framework features an adaptable interactive system and a stakeholder-centered EARN Fairness process to Explain fairness metrics, Ask stakeholders' personal metric preferences, Review metrics collectively, and Negotiate a consensus on metric selection. To gather empirical results, we applied the framework to a credit rating scenario and conducted a user study involving 18 decision subjects without AI knowledge. We identify their personal metric preferences and their acceptable level of unfairness in individual sessions. Subsequently, we uncovered how they reached metric consensus in team sessions. Our work shows that the EARN Fairness framework enables stakeholders to express personal preferences and reach consensus, providing practical guidance for implementing human-centered AI fairness in high-risk contexts. Through this approach, we aim to harmonize fairness expectations of diverse stakeholders, fostering more equitable and inclusive AI fairness.
- [20] arXiv:2407.11625 (cross-list from cs.CV) [pdf, html, other]
-
Title: Beware of Validation by Eye: Visual Validation of Linear Trends in ScatterplotsComments: Preprint and Author Version of a Full Paper, accepted to the 2024 IEEE Visualization Conference (VIS)Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
Visual validation of regression models in scatterplots is a common practice for assessing model quality, yet its efficacy remains unquantified. We conducted two empirical experiments to investigate individuals' ability to visually validate linear regression models (linear trends) and to examine the impact of common visualization designs on validation quality. The first experiment showed that the level of accuracy for visual estimation of slope (i.e., fitting a line to data) is higher than for visual validation of slope (i.e., accepting a shown line). Notably, we found bias toward slopes that are "too steep" in both cases. This lead to novel insights that participants naturally assessed regression with orthogonal distances between the points and the line (i.e., ODR regression) rather than the common vertical distances (OLS regression). In the second experiment, we investigated whether incorporating common designs for regression visualization (error lines, bounding boxes, and confidence intervals) would improve visual validation. Even though error lines reduced validation bias, results failed to show the desired improvements in accuracy for any design. Overall, our findings suggest caution in using visual model validation for linear trends in scatterplots.
- [21] arXiv:2407.11671 (cross-list from cs.RO) [pdf, other]
-
Title: A Comparative Analysis of Interactive Reinforcement Learning Algorithms in Warehouse Robot Grid Based EnvironmentSubjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
The field of warehouse robotics is currently in high demand, with major technology and logistics companies making significant investments in these advanced systems. Training robots to operate in such complex environments is challenging, often requiring human supervision for adaptation and learning. Interactive reinforcement learning (IRL) is a key training methodology in human-computer interaction. This paper presents a comparative study of two IRL algorithms: Q-learning and SARSA, both trained in a virtual grid-simulation-based warehouse environment. To maintain consistent feedback rewards and avoid bias, feedback was provided by the same individual throughout the study.
- [22] arXiv:2407.11823 (cross-list from cs.LG) [pdf, html, other]
-
Title: Harmonizing Safety and Speed: A Human-Algorithm Approach to Enhance the FDA's Medical Device Clearance PolicySubjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Optimization and Control (math.OC); Machine Learning (stat.ML)
The United States Food and Drug Administration's (FDA's) Premarket Notification 510(K) pathway allows manufacturers to gain approval for a medical device by demonstrating its substantial equivalence to another legally marketed device. However, the inherent ambiguity of this regulatory procedure has led to high recall rates for many devices cleared through this pathway. This trend has raised significant concerns regarding the efficacy of the FDA's current approach, prompting a reassessment of the 510(K) regulatory framework. In this paper, we develop a combined human-algorithm approach to assist the FDA in improving its 510(k) medical device clearance process by reducing the risk of potential recalls and the workload imposed on the FDA. We first develop machine learning methods to estimate the risk of recall of 510(k) medical devices based on the information available at the time of submission. We then propose a data-driven clearance policy that recommends acceptance, rejection, or deferral to FDA's committees for in-depth evaluation. We conduct an empirical study using a unique large-scale dataset of over 31,000 medical devices and 12,000 national and international manufacturers from over 65 countries that we assembled based on data sources from the FDA and Centers for Medicare and Medicaid Service (CMS). A conservative evaluation of our proposed policy based on this data shows a 38.9% improvement in the recall rate and a 43.0% reduction in the FDA's workload. Our analyses also indicate that implementing our policy could result in significant annual cost-savings ranging between \$2.4 billion and \$2.7 billion, which highlights the value of using a holistic and data-driven approach to improve the FDA's current 510(K) medical device evaluation pathway.
Cross submissions for Wednesday, 17 July 2024 (showing 11 of 11 entries )
- [23] arXiv:2309.12367 (replaced) [pdf, html, other]
-
Title: Examining the Influence of Varied Levels of Domain Knowledge Base Inclusion in GPT-based Intelligent TutorsJournal-ref: Proceedings of the 17th International Conference on Educational Data Mining, Pages 649-657, 2024Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Recent advancements in large language models (LLMs) have facilitated the development of chatbots with sophisticated conversational capabilities. However, LLMs exhibit frequent inaccurate responses to queries, hindering applications in educational settings. In this paper, we investigate the effectiveness of integrating a knowledge base (KB) with LLM intelligent tutors to increase response reliability. To achieve this, we design a scaleable KB that affords educational supervisors seamless integration of lesson curricula, which is automatically processed by the intelligent tutoring system. We then detail an evaluation, where student participants were presented with questions about the artificial intelligence curriculum to respond to. GPT-4 intelligent tutors with varying hierarchies of KB access and human domain experts then assessed these responses. Lastly, students cross-examined the intelligent tutors' responses to the domain experts' and ranked their various pedagogical abilities. Results suggest that, although these intelligent tutors still demonstrate a lower accuracy compared to domain experts, the accuracy of the intelligent tutors increases when access to a KB is granted. We also observe that the intelligent tutors with KB access exhibit better pedagogical abilities to speak like a teacher and understand students than those of domain experts, while their ability to help students remains lagging behind domain experts.
- [24] arXiv:2402.14947 (replaced) [pdf, html, other]
-
Title: An Avalanche of Images on Telegram Preceded Russia's Full-Scale Invasion of UkraineComments: 20 pages, 7 figuresSubjects: Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Social and Information Networks (cs.SI)
Governments use propaganda, including through visual content -- or Politically Salient Image Patterns (PSIP) -- on social media, to influence and manipulate public opinion. In the present work, we collected Telegram post-history of from 989 Russian milbloggers to better understand the social and political narratives that circulated online in the months surrounding Russia's 2022 full-scale invasion of Ukraine. Overall, we found an 8,925% increase (p<0.001) in the number of posts and a 5,352% increase (p<0.001) in the number of images posted by these accounts in the two weeks prior to the invasion. We also observed a similar increase in the number and intensity of politically salient manipulated images that circulated on Telegram. Although this paper does not evaluate malice or coordination in these activities, we do conclude with a call for further research into the role that manipulated visual media has in the lead-up to instability events and armed conflict.
- [25] arXiv:2405.08948 (replaced) [pdf, html, other]
-
Title: Analyzing Nursing Assistant Attitudes Towards Empathic Geriatric Caregiving Using Quantitative EthnographyComments: B. Kiafar, S. Daher, S. Sharmin, A. Ahmmed, L. Thiamwong, and R. L. Barmaki, ''Analyzing Nursing Assistant Attitudes Towards Geriatric Caregiving Using Epistemic Network Analysis'', International Conference in Quantitative Ethnography (ICQE 24), Philadelphia, PA, USA, Nov 3 - 5, 2024 (Accepted)Subjects: Human-Computer Interaction (cs.HC)
An emergent challenge in geriatric care is improving the quality of care, which requires insight from stakeholders. Qualitative methods offer detailed insights, but they can be biased and have limited generalizability, while quantitative methods may miss nuances. Network-based approaches, such as quantitative ethnography (QE), can bridge this methodological gap. By leveraging the strengths of both methods, QE provides profound insights into need-finding interviews. In this paper, to better understand geriatric care attitudes, we interviewed ten nursing assistants, used QE to analyze the data, and compared their daily activities in real life with training experiences. A two-sample t-test with a large effect size (Cohen's d=1.63) indicated a significant difference between real-life and training activities. The findings suggested incorporating more empathetic training scenarios into the future design of our geriatric care simulation. The results have implications for human-computer interaction and human factors. This is illustrated by presenting an example of using QE to analyze expert interviews with nursing assistants as caregivers to inform subsequent design processes.
- [26] arXiv:2406.18675 (replaced) [pdf, html, other]
-
Title: Human-AI Collaborative Taxonomy Construction: A Case Study in Profession-Specific Writing AssistantsComments: Accepted to CHI 2024 In2Writing WorkshopSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Large Language Models (LLMs) have assisted humans in several writing tasks, including text revision and story generation. However, their effectiveness in supporting domain-specific writing, particularly in business contexts, is relatively less explored. Our formative study with industry professionals revealed the limitations in current LLMs' understanding of the nuances in such domain-specific writing. To address this gap, we propose an approach of human-AI collaborative taxonomy development to perform as a guideline for domain-specific writing assistants. This method integrates iterative feedback from domain experts and multiple interactions between these experts and LLMs to refine the taxonomy. Through larger-scale experiments, we aim to validate this methodology and thus improve LLM-powered writing assistance, tailoring it to meet the unique requirements of different stakeholder needs.
- [27] arXiv:2407.08850 (replaced) [pdf, html, other]
-
Title: UICrit: Enhancing Automated Design Evaluation with a UICritique DatasetSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Automated UI evaluation can be beneficial for the design process; for example, to compare different UI designs, or conduct automated heuristic evaluation. LLM-based UI evaluation, in particular, holds the promise of generalizability to a wide variety of UI types and evaluation tasks. However, current LLM-based techniques do not yet match the performance of human evaluators. We hypothesize that automatic evaluation can be improved by collecting a targeted UI feedback dataset and then using this dataset to enhance the performance of general-purpose LLMs. We present a targeted dataset of 3,059 design critiques and quality ratings for 983 mobile UIs, collected from seven experienced designers. We carried out an in-depth analysis to characterize the dataset's features. We then applied this dataset to achieve a 55% performance gain in LLM-generated UI feedback via various few-shot and visual prompting techniques. We also discuss future applications of this dataset, including training a reward model for generative UI techniques, and fine-tuning a tool-agnostic multi-modal LLM that automates UI evaluation.
- [28] arXiv:2407.09528 (replaced) [pdf, other]
-
Title: Prism XR -- A Curated Exhibition Experience in Virtual Reality with Peer Annotation Features and Virtual Guides for Art and Archaeology ClassesSubjects: Human-Computer Interaction (cs.HC)
The Prism XR project is a curated exhibition experience in virtual reality (VR) for art and archaeology education with features designed for the enhancement of interactivity and collaborative learning. The project integrates peer annotations and a virtual exhibition guide to augment educational experiences. The peer annotation features are intended for facilitating visitor critiques and comments pivotal in fostering a dialog between the curator and the audience and a dialogue between the visitors in art and archaeology education, which are demonstrated to have positive impacts on the learning motivations and learning outcomes. The virtual exhibition guide is intended to address the issue of isolation in the virtual exhibition space and to increase interactivity in the virtual curatorial experiences.
- [29] arXiv:2306.03928 (replaced) [pdf, html, other]
-
Title: Designing Decision Support Systems Using Counterfactual Prediction SetsComments: Best paper award in the ICML 2023 AI&HCI Workshop, spotlight paper at ICML 2024Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Methodology (stat.ME); Machine Learning (stat.ML)
Decision support systems for classification tasks are predominantly designed to predict the value of the ground truth labels. However, since their predictions are not perfect, these systems also need to make human experts understand when and how to use these predictions to update their own predictions. Unfortunately, this has been proven challenging. In this context, it has been recently argued that an alternative type of decision support systems may circumvent this challenge. Rather than providing a single label prediction, these systems provide a set of label prediction values constructed using a conformal predictor, namely a prediction set, and forcefully ask experts to predict a label value from the prediction set. However, the design and evaluation of these systems have so far relied on stylized expert models, questioning their promise. In this paper, we revisit the design of this type of systems from the perspective of online learning and develop a methodology that does not require, nor assumes, an expert model. Our methodology leverages the nested structure of the prediction sets provided by any conformal predictor and a natural counterfactual monotonicity assumption to achieve an exponential improvement in regret in comparison to vanilla bandit algorithms. We conduct a large-scale human subject study ($n = 2{,}751$) to compare our methodology to several competitive baselines. The results show that, for decision support systems based on prediction sets, limiting experts' level of agency leads to greater performance than allowing experts to always exercise their own agency. We have made available the data gathered in our human subject study as well as an open source implementation of our system at this https URL.
- [30] arXiv:2403.05060 (replaced) [pdf, html, other]
-
Title: Multimodal Infusion Tuning for Large ModelsSubjects: Multimedia (cs.MM); Human-Computer Interaction (cs.HC)
Recent advancements in large-scale models have showcased remarkable generalization capabilities in various tasks. However, integrating multimodal processing into these models presents a significant challenge, as it often comes with a high computational burden. To address this challenge, we introduce a new parameter-efficient multimodal tuning strategy for large models in this paper, referred to as Multimodal Infusion Tuning (MiT). MiT leverages decoupled self-attention mechanisms within large language models to effectively integrate information from diverse modalities such as images and acoustics. In MiT, we also design a novel adaptive rescaling strategy at the attention head level, which optimizes the representation of infused multimodal features. Notably, all foundation models are kept frozen during the tuning process to reduce the computational burden and only 2.5\% parameters are tunable. We conduct experiments across a range of multimodal tasks, including image-related tasks like referring segmentation and non-image tasks such as sentiment analysis. Our results showcase that MiT achieves state-of-the-art performance in multimodal understanding while significantly reducing computational overhead(10\% of previous methods). Moreover, our tuned model exhibits robust reasoning abilities even in complex scenarios.
- [31] arXiv:2406.10273 (replaced) [pdf, html, other]
-
Title: Beyond Words: On Large Language Models Actionability in Mission-Critical Risk AnalysisSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
Context. Risk analysis assesses potential risks in specific scenarios. Risk analysis principles are context-less; the same methodology can be applied to a risk connected to health and information technology security. Risk analysis requires a vast knowledge of national and international regulations and standards and is time and effort-intensive. A large language model can quickly summarize information in less time than a human and can be fine-tuned to specific tasks. Aim. Our empirical study aims to investigate the effectiveness of Retrieval-Augmented Generation and fine-tuned LLM in Risk analysis. To our knowledge, no prior study has explored its capabilities in risk analysis. Method. We manually curated \totalscenarios unique scenarios leading to \totalsamples representative samples from over 50 mission-critical analyses archived by the industrial context team in the last five years. We compared the base GPT-3.5 and GPT-4 models versus their Retrieval-Augmented Generation and fine-tuned counterparts. We employ two human experts as competitors of the models and three other three human experts to review the models and the former human expert's analysis. The reviewers analyzed 5,000 scenario analyses. Results and Conclusions. HEs demonstrated higher accuracy, but LLMs are quicker and more actionable. Moreover, our findings show that RAG-assisted LLMs have the lowest hallucination rates, effectively uncovering hidden risks and complementing human expertise. Thus, the choice of model depends on specific needs, with FTMs for accuracy, RAG for hidden risks discovery, and base models for comprehensiveness and actionability. Therefore, experts can leverage LLMs for an effective complementing companion in risk analysis within a condensed timeframe. They can also save costs by averting unnecessary expenses associated with implementing unwarranted countermeasures.