-
LiveBench: A Challenging, Contamination-Free LLM Benchmark
Authors:
Colin White,
Samuel Dooley,
Manley Roberts,
Arka Pal,
Ben Feuer,
Siddhartha Jain,
Ravid Shwartz-Ziv,
Neel Jain,
Khalid Saifullah,
Siddartha Naidu,
Chinmay Hegde,
Yann LeCun,
Tom Goldstein,
Willie Neiswanger,
Micah Goldblum
Abstract:
Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In…
▽ More
Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be immune to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-free versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. LiveBench is difficult, with top models achieving below 65% accuracy. We release all questions, code, and model answers. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Discovering influential text using convolutional neural networks
Authors:
Megan Ayers,
Luke Sanford,
Margaret Roberts,
Eddie Yang
Abstract:
Experimental methods for estimating the impacts of text on human evaluation have been widely used in the social sciences. However, researchers in experimental settings are usually limited to testing a small number of pre-specified text treatments. While efforts to mine unstructured texts for features that causally affect outcomes have been ongoing in recent years, these models have primarily focus…
▽ More
Experimental methods for estimating the impacts of text on human evaluation have been widely used in the social sciences. However, researchers in experimental settings are usually limited to testing a small number of pre-specified text treatments. While efforts to mine unstructured texts for features that causally affect outcomes have been ongoing in recent years, these models have primarily focused on the topics or specific words of text, which may not always be the mechanism of the effect. We connect these efforts with NLP interpretability techniques and present a method for flexibly discovering clusters of similar text phrases that are predictive of human reactions to texts using convolutional neural networks. When used in an experimental setting, this method can identify text treatments and their effects under certain assumptions. We apply the method to two datasets. The first enables direct validation of the model's ability to detect phrases known to cause the outcome. The second demonstrates its ability to flexibly discover text treatments with varying textual structures. In both cases, the model learns a greater variety of text treatments compared to benchmark methods, and these text features quantitatively meet or exceed the ability of benchmark methods to predict the outcome.
△ Less
Submitted 21 June, 2024; v1 submitted 14 June, 2024;
originally announced June 2024.
-
Large Language Models Must Be Taught to Know What They Don't Know
Authors:
Sanyam Kapoor,
Nate Gruver,
Manley Roberts,
Katherine Collins,
Arka Pal,
Umang Bhatt,
Adrian Weller,
Samuel Dooley,
Micah Goldblum,
Andrew Gordon Wilson
Abstract:
When using large language models (LLMs) in high-stakes applications, we need to know when we can trust their predictions. Some works argue that prompting high-performance LLMs is sufficient to produce calibrated uncertainties, while others introduce sampling methods that can be prohibitively expensive. In this work, we first argue that prompting on its own is insufficient to achieve good calibrati…
▽ More
When using large language models (LLMs) in high-stakes applications, we need to know when we can trust their predictions. Some works argue that prompting high-performance LLMs is sufficient to produce calibrated uncertainties, while others introduce sampling methods that can be prohibitively expensive. In this work, we first argue that prompting on its own is insufficient to achieve good calibration and then show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead. We show that a thousand graded examples are sufficient to outperform baseline methods and that training through the features of a model is necessary for good performance and tractable for large open-source models when using LoRA. We also investigate the mechanisms that enable reliable LLM uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators, applicable not just to their own uncertainties but also the uncertainty of other models. Lastly, we show that uncertainty estimates inform human use of LLMs in human-AI collaborative settings through a user study.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
A study on the adequacy of common IQA measures for medical images
Authors:
Anna Breger,
Clemens Karner,
Ian Selby,
Janek Gröhl,
Sören Dittmer,
Edward Lilley,
Judith Babar,
Jake Beckford,
Timothy J Sadler,
Shahab Shahipasand,
Arthikkaa Thavakumar,
Michael Roberts,
Carola-Bibiane Schönlieb
Abstract:
Image quality assessment (IQA) is standard practice in the development stage of novel machine learning algorithms that operate on images. The most commonly used IQA measures have been developed and tested for natural images, but not in the medical setting. Reported inconsistencies arising in medical images are not surprising, as they have different properties than natural images. In this study, we…
▽ More
Image quality assessment (IQA) is standard practice in the development stage of novel machine learning algorithms that operate on images. The most commonly used IQA measures have been developed and tested for natural images, but not in the medical setting. Reported inconsistencies arising in medical images are not surprising, as they have different properties than natural images. In this study, we test the applicability of common IQA measures for medical image data by comparing their assessment to manually rated chest X-ray (5 experts) and photoacoustic image data (1 expert). Moreover, we include supplementary studies on grayscale natural images and accelerated brain MRI data. The results of all experiments show a similar outcome in line with previous findings for medical imaging: PSNR and SSIM in the default setting are in the lower range of the result list and HaarPSI outperforms the other tested measures in the overall performance. Also among the top performers in our medical experiments are the full reference measures DISTS, FSIM, LPIPS and MS-SSIM. Generally, the results on natural images yield considerably higher correlations, suggesting that the additional employment of tailored IQA measures for medical imaging algorithms is needed.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
A study of why we need to reassess full reference image quality assessment with medical images
Authors:
Anna Breger,
Ander Biguri,
Malena Sabaté Landman,
Ian Selby,
Nicole Amberg,
Elisabeth Brunner,
Janek Gröhl,
Sepideh Hatamikia,
Clemens Karner,
Lipeng Ning,
Sören Dittmer,
Michael Roberts,
AIX-COVNET Collaboration,
Carola-Bibiane Schönlieb
Abstract:
Image quality assessment (IQA) is not just indispensable in clinical practice to ensure high standards, but also in the development stage of novel algorithms that operate on medical images with reference data. This paper provides a structured and comprehensive collection of examples where the two most common full reference (FR) image quality measures prove to be unsuitable for the assessment of no…
▽ More
Image quality assessment (IQA) is not just indispensable in clinical practice to ensure high standards, but also in the development stage of novel algorithms that operate on medical images with reference data. This paper provides a structured and comprehensive collection of examples where the two most common full reference (FR) image quality measures prove to be unsuitable for the assessment of novel algorithms using different kinds of medical images, including real-world MRI, CT, OCT, X-Ray, digital pathology and photoacoustic imaging data. In particular, the FR-IQA measures PSNR and SSIM are known and tested for working successfully in many natural imaging tasks, but discrepancies in medical scenarios have been noted in the literature. Inconsistencies arising in medical images are not surprising, as they have very different properties than natural images which have not been targeted nor tested in the development of the mentioned measures, and therefore might imply wrong judgement of novel methods for medical images. Therefore, improvement is urgently needed in particular in this era of AI to increase explainability, reproducibility and generalizability in machine learning for medical imaging and beyond. On top of the pitfalls we will provide ideas for future research as well as suggesting guidelines for the usage of FR-IQA measures applied to medical images.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
FedMAP: Unlocking Potential in Personalized Federated Learning through Bi-Level MAP Optimization
Authors:
Fan Zhang,
Carlos Esteve-Yagüe,
Sören Dittmer,
Carola-Bibiane Schönlieb,
Michael Roberts
Abstract:
Federated Learning (FL) enables collaborative training of machine learning models on decentralized data while preserving data privacy. However, data across clients often differs significantly due to class imbalance, feature distribution skew, sample size imbalance, and other phenomena. Leveraging information from these not identically distributed (non-IID) datasets poses substantial challenges. FL…
▽ More
Federated Learning (FL) enables collaborative training of machine learning models on decentralized data while preserving data privacy. However, data across clients often differs significantly due to class imbalance, feature distribution skew, sample size imbalance, and other phenomena. Leveraging information from these not identically distributed (non-IID) datasets poses substantial challenges. FL methods based on a single global model cannot effectively capture the variations in client data and underperform in non-IID settings. Consequently, Personalized FL (PFL) approaches that adapt to each client's data distribution but leverage other clients' data are essential but currently underexplored. We propose a novel Bayesian PFL framework using bi-level optimization to tackle the data heterogeneity challenges. Our proposed framework utilizes the global model as a prior distribution within a Maximum A Posteriori (MAP) estimation of personalized client models. This approach facilitates PFL by integrating shared knowledge from the prior, thereby enhancing local model performance, generalization ability, and communication efficiency. We extensively evaluated our bi-level optimization approach on real-world and synthetic datasets, demonstrating significant improvements in model accuracy compared to existing methods while reducing communication overhead. This study contributes to PFL by establishing a solid theoretical foundation for the proposed method and offering a robust, ready-to-use framework that effectively addresses the challenges posed by non-IID data in FL.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
When AI Eats Itself: On the Caveats of Data Pollution in the Era of Generative AI
Authors:
Xiaodan Xing,
Fadong Shi,
Jiahao Huang,
Yinzhe Wu,
Yang Nan,
Sheng Zhang,
Yingying Fang,
Mike Roberts,
Carola-Bibiane Schönlieb,
Javier Del Ser,
Guang Yang
Abstract:
Generative artificial intelligence (AI) technologies and large models are producing realistic outputs across various domains, such as images, text, speech, and music. Creating these advanced generative models requires significant resources, particularly large and high-quality datasets. To minimize training expenses, many algorithm developers use data created by the models themselves as a cost-effe…
▽ More
Generative artificial intelligence (AI) technologies and large models are producing realistic outputs across various domains, such as images, text, speech, and music. Creating these advanced generative models requires significant resources, particularly large and high-quality datasets. To minimize training expenses, many algorithm developers use data created by the models themselves as a cost-effective training solution. However, not all synthetic data effectively improve model performance, necessitating a strategic balance in the use of real versus synthetic data to optimize outcomes.
Currently, the previously well-controlled integration of real and synthetic data is becoming uncontrollable. The widespread and unregulated dissemination of synthetic data online leads to the contamination of datasets traditionally compiled through web scraping, now mixed with unlabeled synthetic data. This trend portends a future where generative AI systems may increasingly rely blindly on consuming self-generated data, raising concerns about model performance and ethical issues. What will happen if generative AI continuously consumes itself without discernment? What measures can we take to mitigate the potential adverse effects?
There is a significant gap in the scientific literature regarding the impact of synthetic data use in generative AI, particularly in terms of the fusion of multimodal information. To address this research gap, this review investigates the consequences of integrating synthetic data blindly on training generative AI on both image and text modalities and explores strategies to mitigate these effects. The goal is to offer a comprehensive view of synthetic data's role, advocating for a balanced approach to its use and exploring practices that promote the sustainable development of generative AI technologies in the era of large models.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Automatically Learning HTN Methods from Landmarks
Authors:
Ruoxi Li,
Dana Nau,
Mark Roberts,
Morgan Fine-Morris
Abstract:
Hierarchical Task Network (HTN) planning usually requires a domain engineer to provide manual input about how to decompose a planning problem. Even HTN-MAKER, a well-known method-learning algorithm, requires a domain engineer to annotate the tasks with information about what to learn. We introduce CURRICULAMA, an HTN method learning algorithm that completely automates the learning process. It uses…
▽ More
Hierarchical Task Network (HTN) planning usually requires a domain engineer to provide manual input about how to decompose a planning problem. Even HTN-MAKER, a well-known method-learning algorithm, requires a domain engineer to annotate the tasks with information about what to learn. We introduce CURRICULAMA, an HTN method learning algorithm that completely automates the learning process. It uses landmark analysis to compose annotated tasks and leverages curriculum learning to order the learning of methods from simpler to more complex. This eliminates the need for manual input, resolving a core issue with HTN-MAKER. We prove CURRICULAMA's soundness, and show experimentally that it has a substantially similar convergence rate in learning a complete set of methods to HTN-MAKER.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
Optimized Model Selection for Estimating Treatment Effects from Costly Simulations of the US Opioid Epidemic
Authors:
Abdulrahman A. Ahmed,
M. Amin Rahimian,
Mark S. Roberts
Abstract:
Agent-based simulation with a synthetic population can help us compare different treatment conditions while keeping everything else constant within the same population (i.e., as digital twins). Such population-scale simulations require large computational power (i.e., CPU resources) to get accurate estimates for treatment effects. We can use meta models of the simulation results to circumvent the…
▽ More
Agent-based simulation with a synthetic population can help us compare different treatment conditions while keeping everything else constant within the same population (i.e., as digital twins). Such population-scale simulations require large computational power (i.e., CPU resources) to get accurate estimates for treatment effects. We can use meta models of the simulation results to circumvent the need to simulate every treatment condition. Selecting the best estimating model at a given sample size (number of simulation runs) is a crucial problem. Depending on the sample size, the ability of the method to estimate accurately can change significantly. In this paper, we discuss different methods to explore what model works best at a specific sample size. In addition to the empirical results, we provide a mathematical analysis of the MSE equation and how its components decide which model to select and why a specific method behaves that way in a range of sample sizes. The analysis showed why the direction estimation method is better than model-based methods in larger sample sizes and how the between-group variation and the within-group variation affect the MSE equation.
△ Less
Submitted 23 March, 2024;
originally announced March 2024.
-
Goal-Oriented End-User Programming of Robots
Authors:
David Porfirio,
Mark Roberts,
Laura M. Hiatt
Abstract:
End-user programming (EUP) tools must balance user control with the robot's ability to plan and act autonomously. Many existing task-oriented EUP tools enforce a specific level of control, e.g., by requiring that users hand-craft detailed sequences of actions, rather than offering users the flexibility to choose the level of task detail they wish to express. We thereby created a novel EUP system,…
▽ More
End-user programming (EUP) tools must balance user control with the robot's ability to plan and act autonomously. Many existing task-oriented EUP tools enforce a specific level of control, e.g., by requiring that users hand-craft detailed sequences of actions, rather than offering users the flexibility to choose the level of task detail they wish to express. We thereby created a novel EUP system, Polaris, that in contrast to most existing EUP tools, uses goal predicates as the fundamental building block of programs. Users can thereby express high-level robot objectives or lower-level checkpoints at their choosing, while an off-the-shelf task planner fills in any remaining program detail. To ensure that goal-specified programs adhere to user expectations of robot behavior, Polaris is equipped with a Plan Visualizer that exposes the planner's output to the user before runtime. In what follows, we describe our design of Polaris and its evaluation with 32 human participants. Our results support the Plan Visualizer's ability to help users craft higher-quality programs. Furthermore, there are strong associations between user perception of the robot and Plan Visualizer usage, and evidence that robot familiarity has a key role in shaping user experience.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Considerations for End-User Development in the Caregiving Domain
Authors:
Laura Stegner,
David Porfirio,
Mark Roberts,
Laura M. Hiatt
Abstract:
As service robots become more capable of autonomous behaviors, it becomes increasingly important to consider how people communicate with a robot what task it should perform and how to do the task. Accordingly, there has been a rise in attention to end-user development (EUD) interfaces, which enable non-roboticist end users to specify tasks for autonomous robots to perform. However, state-of-the-ar…
▽ More
As service robots become more capable of autonomous behaviors, it becomes increasingly important to consider how people communicate with a robot what task it should perform and how to do the task. Accordingly, there has been a rise in attention to end-user development (EUD) interfaces, which enable non-roboticist end users to specify tasks for autonomous robots to perform. However, state-of-the-art EUD interfaces are often constrained through simplified domains or restrictive end-user interaction. Motivated by prior qualitative design work that explores how to integrate a care robot in an assisted living community, we discuss the challenges of EUD in this complex domain. One set of challenges stems from different user-facing representations, e.g., certain tasks may lend themselves better to rule-based trigger-action representations, whereas other tasks may be easier to specify via sequences of actions. The other stems from considering the needs of multiple stakeholders, e.g., caregivers and residents of the facility may all create tasks for the robot, but the robot may not be able to share information about all tasks with all residents due to privacy concerns. We present scenarios that illustrate these challenges and also discuss possible solutions.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
Authors:
Arka Pal,
Deep Karkhanis,
Samuel Dooley,
Manley Roberts,
Siddartha Naidu,
Colin White
Abstract:
Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the relative probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a r…
▽ More
Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the relative probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. We then show empirically that this phenomenon occurs when fine-tuning LLMs on common datasets, especially datasets in which the edit distance between pairs of completions is low. Using these insights, we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we find that DPOP outperforms DPO and other fine-tuning procedures across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. Furthermore, we find that the DPOP-tuned model outperforms the DPO-tuned model (all else equal) on benchmarks independent of the fine-tuning data, such as MT-Bench. Finally, using DPOP, we create and open-source Smaug-34B and Smaug-72B, with the latter becoming the first open-source LLM to surpass an average accuracy of 80% on the HuggingFace Open LLM Leaderboard.
△ Less
Submitted 3 July, 2024; v1 submitted 20 February, 2024;
originally announced February 2024.
-
Human-Centric Goal Reasoning with Ripple-Down Rules
Authors:
Kenji Brameld,
Germán Castro,
Claude Sammut,
Mark Roberts,
David W. Aha
Abstract:
ActorSim is a goal reasoning framework developed at the Naval Research Laboratory. Originally, all goal reasoning rules were hand-crafted. This work extends ActorSim with the capability of learning by demonstration, that is, when a human trainer disagrees with a decision made by the system, the trainer can take over and show the system the correct decision. The learning component uses Ripple-Down…
▽ More
ActorSim is a goal reasoning framework developed at the Naval Research Laboratory. Originally, all goal reasoning rules were hand-crafted. This work extends ActorSim with the capability of learning by demonstration, that is, when a human trainer disagrees with a decision made by the system, the trainer can take over and show the system the correct decision. The learning component uses Ripple-Down Rules (RDR) to build new decision rules to correctly handle similar cases in the future. The system is demonstrated using the RoboCup Rescue Agent Simulation, which simulates a city-wide disaster, requiring emergency services, including fire, ambulance and police, to be dispatched to different sites to evacuate civilians from dangerous situations. The RDRs are implemented in a scripting language, FrameScript, which is used to mediate between ActorSim and the agent simulator. Using Ripple-Down Rules, ActorSim can scale to an order of magnitude more goals than the previous version.
△ Less
Submitted 30 January, 2024;
originally announced February 2024.
-
The curious case of the test set AUROC
Authors:
Michael Roberts,
Alon Hazan,
Sören Dittmer,
James H. F. Rudd,
Carola-Bibiane Schönlieb
Abstract:
Whilst the size and complexity of ML models have rapidly and significantly increased over the past decade, the methods for assessing their performance have not kept pace. In particular, among the many potential performance metrics, the ML community stubbornly continues to use (a) the area under the receiver operating characteristic curve (AUROC) for a validation and test cohort (distinct from trai…
▽ More
Whilst the size and complexity of ML models have rapidly and significantly increased over the past decade, the methods for assessing their performance have not kept pace. In particular, among the many potential performance metrics, the ML community stubbornly continues to use (a) the area under the receiver operating characteristic curve (AUROC) for a validation and test cohort (distinct from training data) or (b) the sensitivity and specificity for the test data at an optimal threshold determined from the validation ROC. However, we argue that considering scores derived from the test ROC curve alone gives only a narrow insight into how a model performs and its ability to generalise.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
New Horizons: Pioneering Pharmaceutical R&D with Generative AI from lab to the clinic -- an industry perspective
Authors:
Guy Doron,
Sam Genway,
Mark Roberts,
Sai Jasti
Abstract:
The rapid advance of generative AI is reshaping the strategic vision for R&D across industries. The unique challenges of pharmaceutical R&D will see applications of generative AI deliver value along the entire value chain from early discovery to regulatory approval. This perspective reviews these challenges and takes a three-horizon approach to explore the generative AI applications already delive…
▽ More
The rapid advance of generative AI is reshaping the strategic vision for R&D across industries. The unique challenges of pharmaceutical R&D will see applications of generative AI deliver value along the entire value chain from early discovery to regulatory approval. This perspective reviews these challenges and takes a three-horizon approach to explore the generative AI applications already delivering impact, the disruptive opportunities which are just around the corner, and the longer-term transformation which will shape the future of the industry. Selected applications are reviewed for their potential to drive increase productivity, accelerate timelines, improve the quality of research, data and decision making, and support a sustainable future for the industry. Recommendations are given for Pharma R&D leaders developing a generative AI strategy today which will lay the groundwork for getting real value from the technology and safeguarding future growth. Generative AI is today providing new, efficient routes to accessing and combining organisational data to drive productivity. Next, this impact will reach clinical development, enhancing the patient experience, driving operational efficiency, and unlocking digital innovation to better tackle the future burden of disease. Looking to the furthest horizon, rapid acquisition of rich multi-omics data, which capture the 'language of life', in combination with next generation AI technologies will allow organisations to close the loop around phases of the pipeline through rapid, automated generation and testing of hypotheses from bench to bedside. This provides a vision for the future of R&D with sustainability at the core, with reduced timescales and reduced dependency on resources, while offering new hope to patients to treat the untreatable and ultimately cure diseases.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
Data Contamination Through the Lens of Time
Authors:
Manley Roberts,
Himanshu Thakur,
Christine Herlihy,
Colin White,
Samuel Dooley
Abstract:
Recent claims about the impressive abilities of large language models (LLMs) are often supported by evaluating publicly available benchmarks. Since LLMs train on wide swaths of the internet, this practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. Data contamination remains notoriously challenging to measure…
▽ More
Recent claims about the impressive abilities of large language models (LLMs) are often supported by evaluating publicly available benchmarks. Since LLMs train on wide swaths of the internet, this practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. Data contamination remains notoriously challenging to measure and mitigate, even with partial attempts like controlled experimentation of training data, canary strings, or embedding similarities. In this work, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time. Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and find statistically significant trends among LLM pass rate vs. GitHub popularity and release date that provide strong evidence of contamination. By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. We conclude with a discussion of best practices and future steps for publicly releasing benchmarks in the age of LLMs that train on webscale data.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Recent Methodological Advances in Federated Learning for Healthcare
Authors:
Fan Zhang,
Daniel Kreuter,
Yichen Chen,
Sören Dittmer,
Samuel Tull,
Tolou Shadbahr,
BloodCounts! Collaboration,
Jacobus Preller,
James H. F. Rudd,
John A. D. Aston,
Carola-Bibiane Schönlieb,
Nicholas Gleadall,
Michael Roberts
Abstract:
For healthcare datasets, it is often not possible to combine data samples from multiple sites due to ethical, privacy or logistical concerns. Federated learning allows for the utilisation of powerful machine learning algorithms without requiring the pooling of data. Healthcare data has many simultaneous challenges which require new methodologies to address, such as highly-siloed data, class imbala…
▽ More
For healthcare datasets, it is often not possible to combine data samples from multiple sites due to ethical, privacy or logistical concerns. Federated learning allows for the utilisation of powerful machine learning algorithms without requiring the pooling of data. Healthcare data has many simultaneous challenges which require new methodologies to address, such as highly-siloed data, class imbalance, missing data, distribution shifts and non-standardised variables. Federated learning adds significant methodological complexity to conventional centralised machine learning, requiring distributed optimisation, communication between nodes, aggregation of models and redistribution of models. In this systematic review, we consider all papers on Scopus that were published between January 2015 and February 2023 and which describe new federated learning methodologies for addressing challenges with healthcare data. We performed a detailed review of the 89 papers which fulfilled these criteria. Significant systemic issues were identified throughout the literature which compromise the methodologies in many of the papers reviewed. We give detailed recommendations to help improve the quality of the methodology development for federated learning in healthcare.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
Estimating Treatment Effects Using Costly Simulation Samples from a Population-Scale Model of Opioid Use Disorder
Authors:
Abdulrahman A. Ahmed,
M. Amin Rahimian,
Mark S. Roberts
Abstract:
Large-scale models require substantial computational resources for analysis and studying treatment conditions. Specifically, estimating treatment effects using simulations may require a lot of infeasible resources to allocate at every treatment condition. Therefore, it is essential to develop efficient methods to allocate computational resources for estimating treatment effects. Agent-based simula…
▽ More
Large-scale models require substantial computational resources for analysis and studying treatment conditions. Specifically, estimating treatment effects using simulations may require a lot of infeasible resources to allocate at every treatment condition. Therefore, it is essential to develop efficient methods to allocate computational resources for estimating treatment effects. Agent-based simulation allows us to generate highly realistic simulation samples. FRED (A Framework for Reconstructing Epidemiological Dynamics) is an agent-based modeling system with a geospatial perspective using a synthetic population constructed based on the U.S. census data. Given its synthetic population, FRED simulations present a baseline for comparable results from different treatment conditions and treatment conditions. In this paper, we show three other methods for estimating treatment effects. In the first method, we resort to brute-force allocation, where all treatment conditions have an equal number of samples with a relatively large number of simulation runs. In the second method, we try to reduce the number of simulation runs by customizing individual samples required for each treatment effect based on the width of confidence intervals around the mean estimates. In the third method, we use a regression model, which allows us to learn across the treatment conditions such that simulation samples allocated for a treatment condition will help better estimate treatment effects in other conditions. We show that the regression-based methods result in a comparable estimate of treatment effects with less computational resources. The reduced variability and faster convergence of model-based estimates come at the cost of increased bias, and the bias-variance trade-off can be controlled by adjusting the number of model parameters (e.g., including higher-order interaction terms in the regression model).
△ Less
Submitted 24 August, 2023;
originally announced August 2023.
-
Giraffe: Adventures in Expanding Context Lengths in LLMs
Authors:
Arka Pal,
Deep Karkhanis,
Manley Roberts,
Samuel Dooley,
Arvind Sundararajan,
Siddartha Naidu
Abstract:
Modern large language models (LLMs) that rely on attention mechanisms are typically trained with fixed context lengths which enforce upper limits on the length of input sequences that they can handle at evaluation time. To use these models on sequences longer than the train-time context length, one might employ techniques from the growing family of context length extrapolation methods -- most of w…
▽ More
Modern large language models (LLMs) that rely on attention mechanisms are typically trained with fixed context lengths which enforce upper limits on the length of input sequences that they can handle at evaluation time. To use these models on sequences longer than the train-time context length, one might employ techniques from the growing family of context length extrapolation methods -- most of which focus on modifying the system of positional encodings used in the attention mechanism to indicate where tokens or activations are located in the input sequence. We conduct a wide survey of existing methods of context length extrapolation on a base LLaMA or LLaMA 2 model, and introduce some of our own design as well -- in particular, a new truncation strategy for modifying the basis for the position encoding.
We test these methods using three new evaluation tasks (FreeFormQA, AlteredNumericQA, and LongChat-Lines) as well as perplexity, which we find to be less fine-grained as a measure of long context performance of LLMs. We release the three tasks publicly as datasets on HuggingFace. We discover that linear scaling is the best method for extending context length, and show that further gains can be achieved by using longer scales at evaluation time. We also discover promising extrapolation capabilities in the truncated basis. To support further research in this area, we release three new 13B parameter long-context models which we call Giraffe: 4k and 16k context models trained from base LLaMA-13B, and a 32k context model trained from base LLaMA2-13B. We also release the code to replicate our results.
△ Less
Submitted 21 August, 2023;
originally announced August 2023.
-
REFORMS: Reporting Standards for Machine Learning Based Science
Authors:
Sayash Kapoor,
Emily Cantrell,
Kenny Peng,
Thanh Hien Pham,
Christopher A. Bail,
Odd Erik Gundersen,
Jake M. Hofman,
Jessica Hullman,
Michael A. Lones,
Momin M. Malik,
Priyanka Nanayakkara,
Russell A. Poldrack,
Inioluwa Deborah Raji,
Michael Roberts,
Matthew J. Salganik,
Marta Serra-Garcia,
Brandon M. Stewart,
Gilles Vandewiele,
Arvind Narayanan
Abstract:
Machine learning (ML) methods are proliferating in scientific research. However, the adoption of these methods has been accompanied by failures of validity, reproducibility, and generalizability. These failures can hinder scientific progress, lead to false consensus around invalid claims, and undermine the credibility of ML-based science. ML methods are often applied and fail in similar ways acros…
▽ More
Machine learning (ML) methods are proliferating in scientific research. However, the adoption of these methods has been accompanied by failures of validity, reproducibility, and generalizability. These failures can hinder scientific progress, lead to false consensus around invalid claims, and undermine the credibility of ML-based science. ML methods are often applied and fail in similar ways across disciplines. Motivated by this observation, our goal is to provide clear reporting standards for ML-based science. Drawing from an extensive review of past literature, we present the REFORMS checklist ($\textbf{Re}$porting Standards $\textbf{For}$ $\textbf{M}$achine Learning Based $\textbf{S}$cience). It consists of 32 questions and a paired set of guidelines. REFORMS was developed based on a consensus of 19 researchers across computer science, data science, mathematics, social sciences, and biomedical sciences. REFORMS can serve as a resource for researchers when designing and implementing a study, for referees when reviewing papers, and for journals when enforcing standards for transparency and reproducibility.
△ Less
Submitted 19 September, 2023; v1 submitted 15 August, 2023;
originally announced August 2023.
-
Reinterpreting survival analysis in the universal approximator age
Authors:
Sören Dittmer,
Michael Roberts,
Jacobus Preller,
AIX COVNET,
James H. F. Rudd,
John A. D. Aston,
Carola-Bibiane Schönlieb
Abstract:
Survival analysis is an integral part of the statistical toolbox. However, while most domains of classical statistics have embraced deep learning, survival analysis only recently gained some minor attention from the deep learning community. This recent development is likely in part motivated by the COVID-19 pandemic. We aim to provide the tools needed to fully harness the potential of survival ana…
▽ More
Survival analysis is an integral part of the statistical toolbox. However, while most domains of classical statistics have embraced deep learning, survival analysis only recently gained some minor attention from the deep learning community. This recent development is likely in part motivated by the COVID-19 pandemic. We aim to provide the tools needed to fully harness the potential of survival analysis in deep learning. On the one hand, we discuss how survival analysis connects to classification and regression. On the other hand, we provide technical tools. We provide a new loss function, evaluation metrics, and the first universal approximating network that provably produces survival curves without numeric integration. We show that the loss function and model outperform other approaches using a large numerical study.
△ Less
Submitted 25 July, 2023;
originally announced July 2023.
-
Inferring epidemic dynamics using Gaussian process emulation of agent-based simulations
Authors:
Abdulrahman A. Ahmed,
M. Amin Rahimian,
Mark S. Roberts
Abstract:
Computational models help decision makers understand epidemic dynamics to optimize public health interventions. Agent-based simulation of disease spread in synthetic populations allows us to compare and contrast different effects across identical populations or to investigate the effect of interventions keeping every other factor constant between ``digital twins''. FRED (A Framework for Reconstruc…
▽ More
Computational models help decision makers understand epidemic dynamics to optimize public health interventions. Agent-based simulation of disease spread in synthetic populations allows us to compare and contrast different effects across identical populations or to investigate the effect of interventions keeping every other factor constant between ``digital twins''. FRED (A Framework for Reconstructing Epidemiological Dynamics) is an agent-based modeling system with a geo-spatial perspective using a synthetic population that is constructed based on the U.S. census data. In this paper, we show how Gaussian process regression can be used on FRED-synthesized data to infer the differing spatial dispersion of the epidemic dynamics for two disease conditions that start from the same initial conditions and spread among identical populations. Our results showcase the utility of agent-based simulation frameworks such as FRED for inferring differences between conditions where controlling for all confounding factors for such comparisons is next to impossible without synthetic data.
△ Less
Submitted 11 September, 2023; v1 submitted 22 July, 2023;
originally announced July 2023.
-
Dis-AE: Multi-domain & Multi-task Generalisation on Real-World Clinical Data
Authors:
Daniel Kreuter,
Samuel Tull,
Julian Gilbey,
Jacobus Preller,
BloodCounts! Consortium,
John A. D. Aston,
James H. F. Rudd,
Suthesh Sivapalaratnam,
Carola-Bibiane Schönlieb,
Nicholas Gleadall,
Michael Roberts
Abstract:
Clinical data is often affected by clinically irrelevant factors such as discrepancies between measurement devices or differing processing methods between sites. In the field of machine learning (ML), these factors are known as domains and the distribution differences they cause in the data are known as domain shifts. ML models trained using data from one domain often perform poorly when applied t…
▽ More
Clinical data is often affected by clinically irrelevant factors such as discrepancies between measurement devices or differing processing methods between sites. In the field of machine learning (ML), these factors are known as domains and the distribution differences they cause in the data are known as domain shifts. ML models trained using data from one domain often perform poorly when applied to data from another domain, potentially leading to wrong predictions. As such, developing machine learning models that can generalise well across multiple domains is a challenging yet essential task in the successful application of ML in clinical practice. In this paper, we propose a novel disentangled autoencoder (Dis-AE) neural network architecture that can learn domain-invariant data representations for multi-label classification of medical measurements even when the data is influenced by multiple interacting domain shifts at once. The model utilises adversarial training to produce data representations from which the domain can no longer be determined. We evaluate the model's domain generalisation capabilities on synthetic datasets and full blood count (FBC) data from blood donors as well as primary and secondary care patients, showing that Dis-AE improves model generalisation on multiple domains simultaneously while preserving clinically relevant information.
△ Less
Submitted 15 June, 2023;
originally announced June 2023.
-
Algorithmic Censoring in Dynamic Learning Systems
Authors:
Jennifer Chien,
Margaret Roberts,
Berk Ustun
Abstract:
Dynamic learning systems subject to selective labeling exhibit censoring, i.e. persistent negative predictions assigned to one or more subgroups of points. In applications like consumer finance, this results in groups of applicants that are persistently denied and thus never enter into the training data. In this work, we formalize censoring, demonstrate how it can arise, and highlight difficulties…
▽ More
Dynamic learning systems subject to selective labeling exhibit censoring, i.e. persistent negative predictions assigned to one or more subgroups of points. In applications like consumer finance, this results in groups of applicants that are persistently denied and thus never enter into the training data. In this work, we formalize censoring, demonstrate how it can arise, and highlight difficulties in detection. We consider safeguards against censoring - recourse and randomized-exploration - both of which ensure we collect labels for points that would otherwise go unobserved. The resulting techniques allow examples from censored groups to enter into the training data and correct the model. Our results highlight the otherwise unmeasured harms of censoring and demonstrate the effectiveness of mitigation strategies across a range of data generating processes.
△ Less
Submitted 29 June, 2023; v1 submitted 15 May, 2023;
originally announced May 2023.
-
Estimating defection in subscription-type markets: empirical analysis from the scholarly publishing industry
Authors:
Michael Roberts,
J. Ignacio Deza,
Hisham Ihshaish,
Yanhui Zhu
Abstract:
We present the first empirical study on customer churn prediction in the scholarly publishing industry. The study examines our proposed method for prediction on a customer subscription data over a period of 6.5 years, which was provided by a major academic publisher. We explore the subscription-type market within the context of customer defection and modelling, and provide analysis of the business…
▽ More
We present the first empirical study on customer churn prediction in the scholarly publishing industry. The study examines our proposed method for prediction on a customer subscription data over a period of 6.5 years, which was provided by a major academic publisher. We explore the subscription-type market within the context of customer defection and modelling, and provide analysis of the business model of such markets, and how these characterise the academic publishing business. The proposed method for prediction attempts to provide inference of customer's likelihood of defection on the basis of their re-sampled use of provider resources -in this context, the volume and frequency of content downloads. We show that this approach can be both accurate as well as uniquely useful in the business-to-business context, with which the scholarly publishing business model shares similarities. The main findings of this work suggest that whilst all predictive models examined, especially ensemble methods of machine learning, achieve substantially accurate prediction of churn, nearly a year ahead, this can be furthermore achieved even when the specific behavioural attributes that can be associated to each customer probability to churn are overlooked. Allowing as such highly accurate inference of churn from minimal possible data. We show that modelling churn on the basis of re-sampling customers' use of resources over subscription time is a better (simplified) approach than when considering the high granularity that can often characterise consumption behaviour.
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
Navigating the challenges in creating complex data systems: a development philosophy
Authors:
Sören Dittmer,
Michael Roberts,
Julian Gilbey,
Ander Biguri,
AIX-COVNET Collaboration,
Jacobus Preller,
James H. F. Rudd,
John A. D. Aston,
Carola-Bibiane Schönlieb
Abstract:
In this perspective, we argue that despite the democratization of powerful tools for data science and machine learning over the last decade, developing the code for a trustworthy and effective data science system (DSS) is getting harder. Perverse incentives and a lack of widespread software engineering (SE) skills are among many root causes we identify that naturally give rise to the current syste…
▽ More
In this perspective, we argue that despite the democratization of powerful tools for data science and machine learning over the last decade, developing the code for a trustworthy and effective data science system (DSS) is getting harder. Perverse incentives and a lack of widespread software engineering (SE) skills are among many root causes we identify that naturally give rise to the current systemic crisis in reproducibility of DSSs. We analyze why SE and building large complex systems is, in general, hard. Based on these insights, we identify how SE addresses those difficulties and how we can apply and generalize SE methods to construct DSSs that are fit for purpose. We advocate two key development philosophies, namely that one should incrementally grow -- not biphasically plan and build -- DSSs, and one should always employ two types of feedback loops during development: one which tests the code's correctness and another that evaluates the code's efficacy.
△ Less
Submitted 21 October, 2022;
originally announced October 2022.
-
Understanding CNN Fragility When Learning With Imbalanced Data
Authors:
Damien Dablain,
Kristen N. Jacobson,
Colin Bellinger,
Mark Roberts,
Nitesh Chawla
Abstract:
Convolutional neural networks (CNNs) have achieved impressive results on imbalanced image data, but they still have difficulty generalizing to minority classes and their decisions are difficult to interpret. These problems are related because the method by which CNNs generalize to minority classes, which requires improvement, is wrapped in a blackbox. To demystify CNN decisions on imbalanced data,…
▽ More
Convolutional neural networks (CNNs) have achieved impressive results on imbalanced image data, but they still have difficulty generalizing to minority classes and their decisions are difficult to interpret. These problems are related because the method by which CNNs generalize to minority classes, which requires improvement, is wrapped in a blackbox. To demystify CNN decisions on imbalanced data, we focus on their latent features. Although CNNs embed the pattern knowledge learned from a training set in model parameters, the effect of this knowledge is contained in feature and classification embeddings (FE and CE). These embeddings can be extracted from a trained model and their global, class properties (e.g., frequency, magnitude and identity) can be analyzed. We find that important information regarding the ability of a neural network to generalize to minority classes resides in the class top-K CE and FE. We show that a CNN learns a limited number of class top-K CE per category, and that their number and magnitudes vary based on whether the same class is balanced or imbalanced. This calls into question whether a CNN has learned intrinsic class features, or merely frequently occurring ones that happen to exist in the sampled class distribution. We also hypothesize that latent class diversity is as important as the number of class examples, which has important implications for re-sampling and cost-sensitive methods. These methods generally focus on rebalancing model weights, class numbers and margins; instead of diversifying class latent features through augmentation. We also demonstrate that a CNN has difficulty generalizing to test data if the magnitude of its top-K latent features do not match the training set. We use three popular image datasets and two cost-sensitive algorithms commonly employed in imbalanced learning for our experiments.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
Retrospectives on the Embodied AI Workshop
Authors:
Matt Deitke,
Dhruv Batra,
Yonatan Bisk,
Tommaso Campari,
Angel X. Chang,
Devendra Singh Chaplot,
Changan Chen,
Claudia Pérez D'Arpino,
Kiana Ehsani,
Ali Farhadi,
Li Fei-Fei,
Anthony Francis,
Chuang Gan,
Kristen Grauman,
David Hall,
Winson Han,
Unnat Jain,
Aniruddha Kembhavi,
Jacob Krantz,
Stefan Lee,
Chengshu Li,
Sagnik Majumder,
Oleksandr Maksymets,
Roberto Martín-Martín,
Roozbeh Mottaghi
, et al. (14 additional authors not shown)
Abstract:
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of…
▽ More
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
△ Less
Submitted 4 December, 2022; v1 submitted 13 October, 2022;
originally announced October 2022.
-
Unsupervised Learning under Latent Label Shift
Authors:
Manley Roberts,
Pranav Mani,
Saurabh Garg,
Zachary C. Lipton
Abstract:
What sorts of structure might enable a learner to discover classes from unlabeled data? Traditional approaches rely on feature-space similarity and heroic assumptions on the data. In this paper, we introduce unsupervised learning under Latent Label Shift (LLS), where we have access to unlabeled data from multiple domains such that the label marginals $p_d(y)$ can shift across domains but the class…
▽ More
What sorts of structure might enable a learner to discover classes from unlabeled data? Traditional approaches rely on feature-space similarity and heroic assumptions on the data. In this paper, we introduce unsupervised learning under Latent Label Shift (LLS), where we have access to unlabeled data from multiple domains such that the label marginals $p_d(y)$ can shift across domains but the class conditionals $p(\mathbf{x}|y)$ do not. This work instantiates a new principle for identifying classes: elements that shift together group together. For finite input spaces, we establish an isomorphism between LLS and topic modeling: inputs correspond to words, domains to documents, and labels to topics. Addressing continuous data, we prove that when each label's support contains a separable region, analogous to an anchor word, oracle access to $p(d|\mathbf{x})$ suffices to identify $p_d(y)$ and $p_d(y|\mathbf{x})$ up to permutation. Thus motivated, we introduce a practical algorithm that leverages domain-discriminative models as follows: (i) push examples through domain discriminator $p(d|\mathbf{x})$; (ii) discretize the data by clustering examples in $p(d|\mathbf{x})$ space; (iii) perform non-negative matrix factorization on the discrete data; (iv) combine the recovered $p(y|d)$ with the discriminator outputs $p(d|\mathbf{x})$ to compute $p_d(y|x) \; \forall d$. With semi-synthetic experiments, we show that our algorithm can leverage domain information to improve upon competitive unsupervised classification methods. We reveal a failure mode of standard unsupervised classification methods when feature-space similarity does not indicate true groupings, and show empirically that our method better handles this case. Our results establish a deep connection between distribution shift and topic modeling, opening promising lines for future work.
△ Less
Submitted 1 December, 2022; v1 submitted 26 July, 2022;
originally announced July 2022.
-
Classification of datasets with imputed missing values: does imputation quality matter?
Authors:
Tolou Shadbahr,
Michael Roberts,
Jan Stanczuk,
Julian Gilbey,
Philip Teare,
Sören Dittmer,
Matthew Thorpe,
Ramon Vinas Torne,
Evis Sala,
Pietro Lio,
Mishal Patel,
AIX-COVNET Collaboration,
James H. F. Rudd,
Tuomas Mirtti,
Antti Rannikko,
John A. D. Aston,
Jing Tang,
Carola-Bibiane Schönlieb
Abstract:
Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete, imputed, samples. The focus of the machine learning researcher is then to optimise the downstream classification…
▽ More
Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete, imputed, samples. The focus of the machine learning researcher is then to optimise the downstream classification performance. In this study, we highlight that it is imperative to consider the quality of the imputation. We demonstrate how the commonly used measures for assessing quality are flawed and propose a new class of discrepancy scores which focus on how well the method recreates the overall distribution of the data. To conclude, we highlight the compromised interpretability of classifier models trained using poorly imputed data.
△ Less
Submitted 16 June, 2022;
originally announced June 2022.
-
Data Harmonisation for Information Fusion in Digital Healthcare: A State-of-the-Art Systematic Review, Meta-Analysis and Future Research Directions
Authors:
Yang Nan,
Javier Del Ser,
Simon Walsh,
Carola Schönlieb,
Michael Roberts,
Ian Selby,
Kit Howard,
John Owen,
Jon Neville,
Julien Guiot,
Benoit Ernst,
Ana Pastor,
Angel Alberich-Bayarri,
Marion I. Menzel,
Sean Walsh,
Wim Vos,
Nina Flerin,
Jean-Paul Charbonnier,
Eva van Rikxoort,
Avishek Chatterjee,
Henry Woodruff,
Philippe Lambin,
Leonor Cerdá-Alberich,
Luis Martí-Bonmatí,
Francisco Herrera
, et al. (1 additional authors not shown)
Abstract:
Removing the bias and variance of multicentre data has always been a challenge in large scale digital healthcare studies, which requires the ability to integrate clinical features extracted from data acquired by different scanners and protocols to improve stability and robustness. Previous studies have described various computational approaches to fuse single modality multicentre datasets. However…
▽ More
Removing the bias and variance of multicentre data has always been a challenge in large scale digital healthcare studies, which requires the ability to integrate clinical features extracted from data acquired by different scanners and protocols to improve stability and robustness. Previous studies have described various computational approaches to fuse single modality multicentre datasets. However, these surveys rarely focused on evaluation metrics and lacked a checklist for computational data harmonisation studies. In this systematic review, we summarise the computational data harmonisation approaches for multi-modality data in the digital healthcare field, including harmonisation strategies and evaluation metrics based on different theories. In addition, a comprehensive checklist that summarises common practices for data harmonisation studies is proposed to guide researchers to report their research findings more effectively. Last but not least, flowcharts presenting possible ways for methodology and metric selection are proposed and the limitations of different methods have been surveyed for future research.
△ Less
Submitted 17 January, 2022;
originally announced January 2022.
-
Advancing COVID-19 Diagnosis with Privacy-Preserving Collaboration in Artificial Intelligence
Authors:
Xiang Bai,
Hanchen Wang,
Liya Ma,
Yongchao Xu,
Jiefeng Gan,
Ziwei Fan,
Fan Yang,
Ke Ma,
Jiehua Yang,
Song Bai,
Chang Shu,
Xinyu Zou,
Renhao Huang,
Changzheng Zhang,
Xiaowu Liu,
Dandan Tu,
Chuou Xu,
Wenqing Zhang,
Xi Wang,
Anguo Chen,
Yu Zeng,
Dehua Yang,
Ming-Wei Wang,
Nagaraj Holalkere,
Neil J. Halin
, et al. (21 additional authors not shown)
Abstract:
Artificial intelligence (AI) provides a promising substitution for streamlining COVID-19 diagnoses. However, concerns surrounding security and trustworthiness impede the collection of large-scale representative medical data, posing a considerable challenge for training a well-generalised model in clinical practices. To address this, we launch the Unified CT-COVID AI Diagnostic Initiative (UCADI),…
▽ More
Artificial intelligence (AI) provides a promising substitution for streamlining COVID-19 diagnoses. However, concerns surrounding security and trustworthiness impede the collection of large-scale representative medical data, posing a considerable challenge for training a well-generalised model in clinical practices. To address this, we launch the Unified CT-COVID AI Diagnostic Initiative (UCADI), where the AI model can be distributedly trained and independently executed at each host institution under a federated learning framework (FL) without data sharing. Here we show that our FL model outperformed all the local models by a large yield (test sensitivity /specificity in China: 0.973/0.951, in the UK: 0.730/0.942), achieving comparable performance with a panel of professional radiologists. We further evaluated the model on the hold-out (collected from another two hospitals leaving out the FL) and heterogeneous (acquired with contrast materials) data, provided visual explanations for decisions made by the model, and analysed the trade-offs between the model performance and the communication costs in the federated training process. Our study is based on 9,573 chest computed tomography scans (CTs) from 3,336 patients collected from 23 hospitals located in China and the UK. Collectively, our work advanced the prospects of utilising federated learning for privacy-preserving AI in digital health.
△ Less
Submitted 17 November, 2021;
originally announced November 2021.
-
Substructural fixed-point theorems and the diagonal argument: theme and variations
Authors:
David Michael Roberts
Abstract:
This article re-examines Lawvere's abstract, category-theoretic proof of the fixed-point theorem whose contrapositive is a `universal' diagonal argument. The main result is that the necessary axioms for both the fixed-point theorem and the diagonal argument can be stripped back further, to a semantic analogue of a weak substructural logic lacking weakening or exchange.
This article re-examines Lawvere's abstract, category-theoretic proof of the fixed-point theorem whose contrapositive is a `universal' diagonal argument. The main result is that the necessary axioms for both the fixed-point theorem and the diagonal argument can be stripped back further, to a semantic analogue of a weak substructural logic lacking weakening or exchange.
△ Less
Submitted 9 August, 2023; v1 submitted 1 October, 2021;
originally announced October 2021.
-
Binary self-dual codes of various lengths with new weight enumerators from a modified bordered construction and neighbours
Authors:
Joe Gildea,
Adrian Korban,
Adam Michael Roberts,
Alexander Tylyshchak
Abstract:
In this work, we define a modification of a bordered construction for self-dual codes which utilises $λ$-circulant matrices. We provide the necessary conditions for the construction to produce self-dual codes over finite commutative Frobenius rings of characteristic 2. Using the modified construction together with the neighbour construction, we construct many binary self-dual codes of lengths 54,…
▽ More
In this work, we define a modification of a bordered construction for self-dual codes which utilises $λ$-circulant matrices. We provide the necessary conditions for the construction to produce self-dual codes over finite commutative Frobenius rings of characteristic 2. Using the modified construction together with the neighbour construction, we construct many binary self-dual codes of lengths 54, 68, 82 and 94 with weight enumerators that have previously not been known to exist.
△ Less
Submitted 2 September, 2021;
originally announced September 2021.
-
Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond
Authors:
Amir Feder,
Katherine A. Keith,
Emaad Manzoor,
Reid Pryzant,
Dhanya Sridhar,
Zach Wood-Doughty,
Jacob Eisenstein,
Justin Grimmer,
Roi Reichart,
Margaret E. Roberts,
Brandon M. Stewart,
Victor Veitch,
Diyi Yang
Abstract:
A fundamental goal of scientific research is to learn about causal relationships. However, despite its critical role in the life and social sciences, causality has not had the same importance in Natural Language Processing (NLP), which has traditionally placed more emphasis on predictive tasks. This distinction is beginning to fade, with an emerging area of interdisciplinary research at the conver…
▽ More
A fundamental goal of scientific research is to learn about causal relationships. However, despite its critical role in the life and social sciences, causality has not had the same importance in Natural Language Processing (NLP), which has traditionally placed more emphasis on predictive tasks. This distinction is beginning to fade, with an emerging area of interdisciplinary research at the convergence of causal inference and language processing. Still, research on causality in NLP remains scattered across domains without unified definitions, benchmark datasets and clear articulations of the challenges and opportunities in the application of causal inference to the textual domain, with its unique properties. In this survey, we consolidate research across academic areas and situate it in the broader NLP landscape. We introduce the statistical challenge of estimating causal effects with text, encompassing settings where text is used as an outcome, treatment, or to address confounding. In addition, we explore potential uses of causal inference to improve the robustness, fairness, and interpretability of NLP models. We thus provide a unified overview of causal inference for the NLP community.
△ Less
Submitted 30 July, 2022; v1 submitted 2 September, 2021;
originally announced September 2021.
-
New binary self-dual codes of lengths 56, 62, 78, 92 and 94 from a bordered construction
Authors:
Joe Gildea,
Adrian Korban,
Adam Michael Roberts,
Alexander Tylyshchak
Abstract:
In this paper, we present a new bordered construction for self-dual codes which employs $λ$-circulant matrices. We give the necessary conditions for our construction to produce self-dual codes over a finite commutative Frobenius ring of characteristic 2. Moreover, using our bordered construction together with the well-known building-up and neighbour methods, we construct many binary self-dual code…
▽ More
In this paper, we present a new bordered construction for self-dual codes which employs $λ$-circulant matrices. We give the necessary conditions for our construction to produce self-dual codes over a finite commutative Frobenius ring of characteristic 2. Moreover, using our bordered construction together with the well-known building-up and neighbour methods, we construct many binary self-dual codes of lengths 56, 62, 78, 92 and 94 with parameters in their weight enumerators that were not known in the literature before.
△ Less
Submitted 3 February, 2022; v1 submitted 20 August, 2021;
originally announced August 2021.
-
Group LCD and Group Reversible LCD Codes
Authors:
Steven T. Dougherty,
Joe Gildea,
Adrian Korban,
Adam M. Roberts
Abstract:
In this paper, we give a new method for constructing LCD codes. We employ group rings and a well known map that sends group ring elements to a subring of the $n \times n$ matrices to obtain LCD codes. Our construction method guarantees that our LCD codes are also group codes, namely, the codes are ideals in a group ring. We show that with a certain condition on the group ring element $v,$ one can…
▽ More
In this paper, we give a new method for constructing LCD codes. We employ group rings and a well known map that sends group ring elements to a subring of the $n \times n$ matrices to obtain LCD codes. Our construction method guarantees that our LCD codes are also group codes, namely, the codes are ideals in a group ring. We show that with a certain condition on the group ring element $v,$ one can construct non-trivial group LCD codes. Moreover, we also show that by adding more constraints on the group ring element $v,$ one can construct group LCD codes that are reversible. We present many examples of binary group LCD codes of which some are optimal and group reversible LCD codes with different parameters.
△ Less
Submitted 11 August, 2021;
originally announced August 2021.
-
New binary self-dual codes of lengths 80, 84 and 96 from composite matrices
Authors:
Joe Gildea,
Adrian Korban,
Adam Michael Roberts
Abstract:
In this work, we apply the idea of composite matrices arising from group rings to derive a number of different techniques for constructing self-dual codes over finite commutative Frobenius rings. By applying these techniques over different alphabets, we construct best known singly-even binary self-dual codes of lengths 80, 84 and 96 as well as doubly-even binary self-dual codes of length 96 that w…
▽ More
In this work, we apply the idea of composite matrices arising from group rings to derive a number of different techniques for constructing self-dual codes over finite commutative Frobenius rings. By applying these techniques over different alphabets, we construct best known singly-even binary self-dual codes of lengths 80, 84 and 96 as well as doubly-even binary self-dual codes of length 96 that were not known in the literature before.
△ Less
Submitted 23 June, 2021;
originally announced June 2021.
-
Quaternary Hermitian self-dual codes of lengths 26, 32, 36, 38 and 40 from modifications of well-known circulant constructions
Authors:
Adam Michael Roberts
Abstract:
In this work, we give three new techniques for constructing Hermitian self-dual codes over commutative Frobenius rings with a non-trivial involutory automorphism using $λ$-circulant matrices. The new constructions are derived as modifications of various well-known circulant constructions of self-dual codes. Applying these constructions together with the building-up construction, we construct many…
▽ More
In this work, we give three new techniques for constructing Hermitian self-dual codes over commutative Frobenius rings with a non-trivial involutory automorphism using $λ$-circulant matrices. The new constructions are derived as modifications of various well-known circulant constructions of self-dual codes. Applying these constructions together with the building-up construction, we construct many new best known quaternary Hermitian self-dual codes of lengths 26, 32, 36, 38 and 40.
△ Less
Submitted 24 February, 2021;
originally announced February 2021.
-
New binary self-dual codes of lengths 56, 58, 64, 80 and 92 from a modification of the four circulant construction
Authors:
Joe Gildea,
Adrian Korban,
Adam Michael Roberts
Abstract:
In this work, we give a new technique for constructing self-dual codes over commutative Frobenius rings using $λ$-circulant matrices. The new construction was derived as a modification of the well-known four circulant construction of self-dual codes. Applying this technique together with the building-up construction, we construct singly-even binary self-dual codes of lengths 56, 58, 64, 80 and 92…
▽ More
In this work, we give a new technique for constructing self-dual codes over commutative Frobenius rings using $λ$-circulant matrices. The new construction was derived as a modification of the well-known four circulant construction of self-dual codes. Applying this technique together with the building-up construction, we construct singly-even binary self-dual codes of lengths 56, 58, 64, 80 and 92 that were not known in the literature before. Singly-even self-dual codes of length 80 with $β\in\{2,4,5,6,8\}$ in their weight enumerators are constructed for the first time in the literature.
△ Less
Submitted 23 June, 2021; v1 submitted 20 February, 2021;
originally announced February 2021.
-
Semantics and Axiomatization for Stochastic Differential Dynamic Logic
Authors:
Michael Roberts,
Alexei Kopylov,
Aleksey Nogin
Abstract:
Building on previous work by André Platzer, we present a formal language for Stochastic Differential Dynamic Logic, and define its semantics, axioms and inference rules. Compared to the previous effort, our account of the Stochastic Differential Dynamic Logic follows closer to and is more compatible with the traditional account of the regular Differential Dynamic Logic. We resolve an issue with th…
▽ More
Building on previous work by André Platzer, we present a formal language for Stochastic Differential Dynamic Logic, and define its semantics, axioms and inference rules. Compared to the previous effort, our account of the Stochastic Differential Dynamic Logic follows closer to and is more compatible with the traditional account of the regular Differential Dynamic Logic. We resolve an issue with the well-definedness of the original work's semantics, while showing how to make the logic more expressive by incorporating nondeterministic choice, definite descriptions and differential terms. Definite descriptions necessitate using a three-valued truth semantics. We also give the first Uniform Substitution calculus for Stochastic Differential Dynamic Logic, making it more practical to implement in proof assistants.
△ Less
Submitted 28 April, 2021; v1 submitted 18 February, 2021;
originally announced February 2021.
-
Censorship of Online Encyclopedias: Implications for NLP Models
Authors:
Eddie Yang,
Margaret E. Roberts
Abstract:
While artificial intelligence provides the backbone for many tools people use around the world, recent work has brought to attention that the algorithms powering AI are not free of politics, stereotypes, and bias. While most work in this area has focused on the ways in which AI can exacerbate existing inequalities and discrimination, very little work has studied how governments actively shape trai…
▽ More
While artificial intelligence provides the backbone for many tools people use around the world, recent work has brought to attention that the algorithms powering AI are not free of politics, stereotypes, and bias. While most work in this area has focused on the ways in which AI can exacerbate existing inequalities and discrimination, very little work has studied how governments actively shape training data. We describe how censorship has affected the development of Wikipedia corpuses, text data which are regularly used for pre-trained inputs into NLP algorithms. We show that word embeddings trained on Baidu Baike, an online Chinese encyclopedia, have very different associations between adjectives and a range of concepts about democracy, freedom, collective action, equality, and people and historical events in China than its regularly blocked but uncensored counterpart - Chinese language Wikipedia. We examine the implications of these discrepancies by studying their use in downstream AI applications. Our paper shows how government repression, censorship, and self-censorship may impact training data and the applications that draw from them.
△ Less
Submitted 22 January, 2021;
originally announced January 2021.
-
Universal Semantics for the Stochastic Lambda-Calculus
Authors:
Pedro Amorim,
Dexter Kozen,
Radu Mardare,
Prakash Panangaden,
Michael Roberts
Abstract:
We define sound and adequate denotational and operational semantics for the stochastic lambda calculus. These two semantic approaches build on previous work that used similar techniques to reason about higher-order probabilistic programs, but for the first time admit an adequacy theorem relating the operational and denotational views. This resolves the main issue left open in (Bacci et al. 2018).
We define sound and adequate denotational and operational semantics for the stochastic lambda calculus. These two semantic approaches build on previous work that used similar techniques to reason about higher-order probabilistic programs, but for the first time admit an adequacy theorem relating the operational and denotational views. This resolves the main issue left open in (Bacci et al. 2018).
△ Less
Submitted 14 May, 2021; v1 submitted 26 November, 2020;
originally announced November 2020.
-
Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding
Authors:
Mike Roberts,
Jason Ramapuram,
Anurag Ranjan,
Atulit Kumar,
Miguel Angel Bautista,
Nathan Paczan,
Russ Webb,
Joshua M. Susskind
Abstract:
For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. We address this challenge by introducing Hypersim, a photorealistic synthetic dataset for holistic indoor scene understanding. To create our dataset, we leverage a large repository of synthetic scenes created by professional artists, and we generate 77,400 images…
▽ More
For many fundamental scene understanding tasks, it is difficult or impossible to obtain per-pixel ground truth labels from real images. We address this challenge by introducing Hypersim, a photorealistic synthetic dataset for holistic indoor scene understanding. To create our dataset, we leverage a large repository of synthetic scenes created by professional artists, and we generate 77,400 images of 461 indoor scenes with detailed per-pixel labels and corresponding ground truth geometry. Our dataset: (1) relies exclusively on publicly available 3D assets; (2) includes complete scene geometry, material information, and lighting information for every scene; (3) includes dense per-pixel semantic instance segmentations and complete camera information for every image; and (4) factors every image into diffuse reflectance, diffuse illumination, and a non-diffuse residual term that captures view-dependent lighting effects.
We analyze our dataset at the level of scenes, objects, and pixels, and we analyze costs in terms of money, computation time, and annotation effort. Remarkably, we find that it is possible to generate our entire dataset from scratch, for roughly half the cost of training a popular open-source natural language processing model. We also evaluate sim-to-real transfer performance on two real-world scene understanding tasks - semantic segmentation and 3D shape prediction - where we find that pre-training on our dataset significantly improves performance on both tasks, and achieves state-of-the-art performance on the most challenging Pix3D test set. All of our rendered image data, as well as all the code we used to generate our dataset and perform our experiments, is available online.
△ Less
Submitted 17 August, 2021; v1 submitted 4 November, 2020;
originally announced November 2020.
-
Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans
Authors:
Michael Roberts,
Derek Driggs,
Matthew Thorpe,
Julian Gilbey,
Michael Yeung,
Stephan Ursprung,
Angelica I. Aviles-Rivero,
Christian Etmann,
Cathal McCague,
Lucian Beer,
Jonathan R. Weir-McCall,
Zhongzhao Teng,
Effrossyni Gkrania-Klotsas,
James H. F. Rudd,
Evis Sala,
Carola-Bibiane Schönlieb
Abstract:
Machine learning methods offer great promise for fast and accurate detection and prognostication of COVID-19 from standard-of-care chest radiographs (CXR) and computed tomography (CT) images. Many articles have been published in 2020 describing new machine learning-based models for both of these tasks, but it is unclear which are of potential clinical utility. In this systematic review, we search…
▽ More
Machine learning methods offer great promise for fast and accurate detection and prognostication of COVID-19 from standard-of-care chest radiographs (CXR) and computed tomography (CT) images. Many articles have been published in 2020 describing new machine learning-based models for both of these tasks, but it is unclear which are of potential clinical utility. In this systematic review, we search EMBASE via OVID, MEDLINE via PubMed, bioRxiv, medRxiv and arXiv for published papers and preprints uploaded from January 1, 2020 to October 3, 2020 which describe new machine learning models for the diagnosis or prognosis of COVID-19 from CXR or CT images. Our search identified 2,212 studies, of which 415 were included after initial screening and, after quality screening, 61 studies were included in this systematic review. Our review finds that none of the models identified are of potential clinical use due to methodological flaws and/or underlying biases. This is a major weakness, given the urgency with which validated COVID-19 models are needed. To address this, we give many recommendations which, if followed, will solve these issues and lead to higher quality model development and well documented manuscripts.
△ Less
Submitted 5 January, 2021; v1 submitted 14 August, 2020;
originally announced August 2020.
-
Watching the Watchers: Nonce-based Inverse Surveillance to Remotely Detect Monitoring
Authors:
Laura M. Roberts,
David Plonka
Abstract:
Internet users and service providers do not often know when traffic is being watched but desire a way to determine when, where, and by whom. We present NOISE, the Nonce Observatory for Inverse Surveillance of Eavesdroppers, a method and system that detects monitoring by disseminating nonces - unique, pseudorandom values - in traffic and seeing if they are acted upon unexpectedly, indicating that t…
▽ More
Internet users and service providers do not often know when traffic is being watched but desire a way to determine when, where, and by whom. We present NOISE, the Nonce Observatory for Inverse Surveillance of Eavesdroppers, a method and system that detects monitoring by disseminating nonces - unique, pseudorandom values - in traffic and seeing if they are acted upon unexpectedly, indicating that the nonce-laden traffic is being monitored. Specifically, we embed 64-bit nonces innocuously into IPv6 addresses and disseminate these nonces Internet-wide using a modified traceroute-like tool that makes each outbound probe's source address unique. We continually monitor for subsequent nonce propagation, i.e., activity or interest involving these nonces, e.g., via packet capture on our system's infrastructure. Across three experiments and four months, NOISE detects monitoring more than 200k times, ostensibly in 268 networks, for probes destined for 437 networks. Our results reveal: (a) data collection for security incident handling, (b) traffic information being shared with third parties, and (c) eavesdropping in or near a large commercial peering exchange.
△ Less
Submitted 5 June, 2020; v1 submitted 15 May, 2020;
originally announced May 2020.
-
Understanding How and Why University Students Use Virtual Private Networks
Authors:
Agnieszka Dutkowska-Zuk,
Austin Hounsel,
Andre Xiong,
Molly Roberts,
Brandon Stewart,
Marshini Chetty,
Nick Feamster
Abstract:
We study how and why university students chose and use VPNs, and whether they are aware of the security and privacy risks that VPNs pose. To answer these questions, we conducted 32 in-person interviews and a survey with 349 respondents, all university students in the United States. We find students are mostly concerned with access to content and privacy concerns were often secondary. They made tra…
▽ More
We study how and why university students chose and use VPNs, and whether they are aware of the security and privacy risks that VPNs pose. To answer these questions, we conducted 32 in-person interviews and a survey with 349 respondents, all university students in the United States. We find students are mostly concerned with access to content and privacy concerns were often secondary. They made tradeoffs to achieve a particular goal, such as using a free commercial VPN that may collect their online activities to access an online service in a geographic area. Many users expected that their VPNs were collecting data about them, although they did not understand how VPNs work. We conclude with a discussion of ways to help users make choices about VPNs.
△ Less
Submitted 22 February, 2021; v1 submitted 26 February, 2020;
originally announced February 2020.
-
On Constructing a Knowledge Base of Chinese Criminal Cases
Authors:
Xiaohan Wu,
Benjamin L. Liebman,
Rachel E. Stern,
Margaret E. Roberts,
Amarnath Gupta
Abstract:
We are developing a knowledge base over Chinese judicial decision documents to facilitate landscape analyses of Chinese Criminal Cases. We view judicial decision documents as a mixed-granularity semi-structured text where different levels of the text carry different semantic constructs and entailments. We use a combination of context-sensitive grammar, dependency parsing and discourse analysis to…
▽ More
We are developing a knowledge base over Chinese judicial decision documents to facilitate landscape analyses of Chinese Criminal Cases. We view judicial decision documents as a mixed-granularity semi-structured text where different levels of the text carry different semantic constructs and entailments. We use a combination of context-sensitive grammar, dependency parsing and discourse analysis to extract a formal and interpretable representation of these documents. Our knowledge base is developed by constructing associations between different elements of these documents. The interpretability is contributed in part by our formal representation of the Chinese criminal laws, also as semi-structured documents. The landscape analyses utilize these two representations and enable a law researcher to ask legal pattern analysis queries.
△ Less
Submitted 14 October, 2019;
originally announced October 2019.
-
Chan-Vese Reformulation for Selective Image Segmentation
Authors:
Michael Roberts,
Jack Spencer
Abstract:
Selective segmentation involves incorporating user input to partition an image into foreground and background, by discriminating between objects of a similar type. Typically, such methods involve introducing additional constraints to generic segmentation approaches. However, we show that this is often inconsistent with respect to common assumptions about the image. The proposed method introduces a…
▽ More
Selective segmentation involves incorporating user input to partition an image into foreground and background, by discriminating between objects of a similar type. Typically, such methods involve introducing additional constraints to generic segmentation approaches. However, we show that this is often inconsistent with respect to common assumptions about the image. The proposed method introduces a new fitting term that is more useful in practice than the Chan-Vese framework. In particular, the idea is to define a term that allows for the background to consist of multiple regions of inhomogeneity. We provide comparitive experimental results to alternative approaches to demonstrate the advantages of the proposed method, broadening the possible application of these methods.
△ Less
Submitted 5 July, 2019; v1 submitted 21 November, 2018;
originally announced November 2018.
-
Using Machine Learning to Discern Eruption in Noisy Environments: A Case Study using CO2-driven Cold-Water Geyser in Chimayo, New Mexico
Authors:
B. Yuan,
Y. J. Tan,
M. K. Mudunuru,
O. E. Marcillo,
A. A. Delorey,
P. M. Roberts,
J. D. Webster,
C. N. L. Gammans,
S. Karra,
G. D. Guthrie,
P. A. Johnson
Abstract:
We present an approach based on machine learning (ML) to distinguish eruption and precursory signals of Chimayó geyser (New Mexico, USA) under noisy environments. This geyser can be considered as a natural analog of $\mathrm{CO}_2$ intrusion into shallow water aquifers. By studying this geyser, we can understand upwelling of $\mathrm{CO}_2$-rich fluids from depth, which has relevance to leak monit…
▽ More
We present an approach based on machine learning (ML) to distinguish eruption and precursory signals of Chimayó geyser (New Mexico, USA) under noisy environments. This geyser can be considered as a natural analog of $\mathrm{CO}_2$ intrusion into shallow water aquifers. By studying this geyser, we can understand upwelling of $\mathrm{CO}_2$-rich fluids from depth, which has relevance to leak monitoring in a $\mathrm{CO}_2$ sequestration project. ML methods such as Random Forests (RF) are known to be robust multi-class classifiers and perform well under unfavorable noisy conditions. However, the extent of the RF method's accuracy is poorly understood for this $\mathrm{CO}_2$-driven geysering application. The current study aims to quantify the performance of RF-classifiers to discern the geyser state. Towards this goal, we first present the data collected from the seismometer that is installed near the Chimayó geyser. The seismic signals collected at this site contain different types of noises such as daily temperature variations, seasonal trends, animal movement near the geyser, and human activity. First, we filter the signals from these noises by combining the Butterworth-Highpass filter and an Autoregressive method in a multi-level fashion. We show that by combining these filtering techniques, in a hierarchical fashion, leads to reduction in the noise in the seismic data without removing the precursors and eruption event signals. We then use RF on the filtered data to classify the state of geyser into three classes -- remnant noise, precursor, and eruption states. We show that the classification accuracy using RF on the filtered data is greater than 90\%.These aspects make the proposed ML framework attractive for event discrimination and signal enhancement under noisy conditions, with strong potential for application to monitoring leaks in $\mathrm{CO}_2$ sequestration.
△ Less
Submitted 1 October, 2018;
originally announced October 2018.