subscribe to arXiv mailings

Merlin: A Vision Language Foundation Model for 3D Computed Tomography

Authors: Louis Blankemeier, Joseph Paul Cohen, Ashwin Kumar, Dave Van Veen, Syed Jamal Safdar Gardezi, Magdalini Paschali, Zhihong Chen, Jean-Benoit Delbrouck, Eduardo Reis, Cesar Truyts, Christian Bluethgen, Malte Engmann Kjeldskov Jensen, Sophie Ostmeier, Maya Varma, Jeya Maria Jose Valanarasu, Zhongnan Fang, Zepeng Huo, Zaid Nabulsi, Diego Ardila, Wei-Hung Weng, Edson Amaro Junior, Neera Ahuja, Jason Fries, Nigam H. Shah, Andrew Johnston , et al. (6 additional authors not shown)

Abstract: Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current radiologist shortage, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies. Prior state-of-the-art approaches for automated medical image interpretation leverage vision la… ▽ More Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current radiologist shortage, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies. Prior state-of-the-art approaches for automated medical image interpretation leverage vision language models (VLMs). However, current medical VLMs are generally limited to 2D images and short reports, and do not leverage electronic health record (EHR) data for supervision. We introduce Merlin - a 3D VLM that we train using paired CT scans (6+ million images from 15,331 CTs), EHR diagnosis codes (1.8+ million codes), and radiology reports (6+ million tokens). We evaluate Merlin on 6 task types and 752 individual tasks. The non-adapted (off-the-shelf) tasks include zero-shot findings classification (31 findings), phenotype classification (692 phenotypes), and zero-shot cross-modal retrieval (image to findings and image to impressions), while model adapted tasks include 5-year disease prediction (6 diseases), radiology report generation, and 3D semantic segmentation (20 organs). We perform internal validation on a test set of 5,137 CTs, and external validation on 7,000 clinical CTs and on two public CT datasets (VerSe, TotalSegmentator). Beyond these clinically-relevant evaluations, we assess the efficacy of various network architectures and training strategies to depict that Merlin has favorable performance to existing task-specific baselines. We derive data scaling laws to empirically assess training data needs for requisite downstream task performance. Furthermore, unlike conventional VLMs that require hundreds of GPUs for training, we perform all training on a single GPU. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 18 pages, 7 figures

arXiv:2401.12208 [pdf, other]

CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation

Authors: Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, Emily B. Tsai, Andrew Johnston, Cameron Olsen, Tanishq Mathew Abraham, Sergios Gatidis, Akshay S. Chaudhari, Curtis Langlotz

Abstract: Chest X-rays (CXRs) are the most frequently performed imaging test in clinical practice. Recent advances in the development of vision-language foundation models (FMs) give rise to the possibility of performing automated CXR interpretation, which can assist physicians with clinical decision-making and improve patient outcomes. However, developing FMs that can accurately interpret CXRs is challengin… ▽ More Chest X-rays (CXRs) are the most frequently performed imaging test in clinical practice. Recent advances in the development of vision-language foundation models (FMs) give rise to the possibility of performing automated CXR interpretation, which can assist physicians with clinical decision-making and improve patient outcomes. However, developing FMs that can accurately interpret CXRs is challenging due to the (1) limited availability of large-scale vision-language datasets in the medical image domain, (2) lack of vision and language encoders that can capture the complexities of medical data, and (3) absence of evaluation frameworks for benchmarking the abilities of FMs on CXR interpretation. In this work, we address these challenges by first introducing \emph{CheXinstruct} - a large-scale instruction-tuning dataset curated from 28 publicly-available datasets. We then present \emph{CheXagent} - an instruction-tuned FM capable of analyzing and summarizing CXRs. To build CheXagent, we design a clinical large language model (LLM) for parsing radiology reports, a vision encoder for representing CXR images, and a network to bridge the vision and language modalities. Finally, we introduce \emph{CheXbench} - a novel benchmark designed to systematically evaluate FMs across 8 clinically-relevant CXR interpretation tasks. Extensive quantitative evaluations and qualitative reviews with five expert radiologists demonstrate that CheXagent outperforms previously-developed general- and medical-domain FMs on CheXbench tasks. Furthermore, in an effort to improve model transparency, we perform a fairness evaluation across factors of sex, race and age to highlight potential performance disparities. Our project is at \url{https://stanford-aimi.github.io/chexagent.html}. △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: 24 pages, 8 figures

arXiv:2308.08576 [pdf]

Artistic control over the glitch in AI-generated motion capture

Authors: Jamal Knight, Andrew Johnston, Adam Berry

Abstract: Artificial intelligence (AI) models are prevalent today and provide a valuable tool for artists. However, a lesser-known artifact that comes with AI models that is not always discussed is the glitch. Glitches occur for various reasons; sometimes, they are known, and sometimes they are a mystery. Artists who use AI models to generate art might not understand the reason for the glitch but often want… ▽ More Artificial intelligence (AI) models are prevalent today and provide a valuable tool for artists. However, a lesser-known artifact that comes with AI models that is not always discussed is the glitch. Glitches occur for various reasons; sometimes, they are known, and sometimes they are a mystery. Artists who use AI models to generate art might not understand the reason for the glitch but often want to experiment and explore novel ways of augmenting the output of the glitch. This paper discusses some of the questions artists have when leveraging the glitch in AI art production. It explores the unexpected positive outcomes produced by glitches in the specific context of motion capture and performance art. △ Less

Submitted 15 August, 2023; originally announced August 2023.

arXiv:2307.15273 [pdf, other]

Recovering high-quality FODs from a reduced number of diffusion-weighted images using a model-driven deep learning architecture

Authors: J Bartlett, C E Davey, L A Johnston, J Duan

Abstract: Fibre orientation distribution (FOD) reconstruction using deep learning has the potential to produce accurate FODs from a reduced number of diffusion-weighted images (DWIs), decreasing total imaging time. Diffusion acquisition invariant representations of the DWI signals are typically used as input to these methods to ensure that they can be applied flexibly to data with different b-vectors and b-… ▽ More Fibre orientation distribution (FOD) reconstruction using deep learning has the potential to produce accurate FODs from a reduced number of diffusion-weighted images (DWIs), decreasing total imaging time. Diffusion acquisition invariant representations of the DWI signals are typically used as input to these methods to ensure that they can be applied flexibly to data with different b-vectors and b-values; however, this means the network cannot condition its output directly on the DWI signal. In this work, we propose a spherical deconvolution network, a model-driven deep learning FOD reconstruction architecture, that ensures intermediate and output FODs produced by the network are consistent with the input DWI signals. Furthermore, we implement a fixel classification penalty within our loss function, encouraging the network to produce FODs that can subsequently be segmented into the correct number of fixels and improve downstream fixel-based analysis. Our results show that the model-based deep learning architecture achieves competitive performance compared to a state-of-the-art FOD super-resolution network, FOD-Net. Moreover, we show that the fixel classification penalty can be tuned to offer improved performance with respect to metrics that rely on accurately segmented of FODs. Our code is publicly available at https://github.com/Jbartlett6/SDNet . △ Less

Submitted 27 July, 2023; originally announced July 2023.

Comments: 10 pages, 7 figures, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2211.15621 [pdf, other]

doi 10.1007/978-3-031-29573-7_9

A Boosting Approach to Constructing an Ensemble Stack

Authors: Zhilei Zhou, Ziyu Qiu, Brad Niblett, Andrew Johnston, Jeffrey Schwartzentruber, Nur Zincir-Heywood, Malcolm Heywood

Abstract: An approach to evolutionary ensemble learning for classification is proposed in which boosting is used to construct a stack of programs. Each application of boosting identifies a single champion and a residual dataset, i.e. the training records that thus far were not correctly classified. The next program is only trained against the residual, with the process iterating until some maximum ensemble… ▽ More An approach to evolutionary ensemble learning for classification is proposed in which boosting is used to construct a stack of programs. Each application of boosting identifies a single champion and a residual dataset, i.e. the training records that thus far were not correctly classified. The next program is only trained against the residual, with the process iterating until some maximum ensemble size or no further residual remains. Training against a residual dataset actively reduces the cost of training. Deploying the ensemble as a stack also means that only one classifier might be necessary to make a prediction, so improving interpretability. Benchmarking studies are conducted to illustrate competitiveness with the prediction accuracy of current state-of-the-art evolutionary ensemble learning algorithms, while providing solutions that are orders of magnitude simpler. Further benchmarking with a high cardinality dataset indicates that the proposed method is also more accurate and efficient than XGBoost. △ Less

Submitted 28 November, 2022; originally announced November 2022.

Comments: 16 pages, 3 figures, 6 tables

Report number: 13986

Journal ref: LNCS, 29 March 2023

arXiv:2210.10391 [pdf]

Machine Learning for a Sustainable Energy Future

Authors: Zhenpeng Yao, Yanwei Lum, Andrew Johnston, Luis Martin Mejia-Mendoza, Xin Zhou, Yonggang Wen, Alan Aspuru-Guzik, Edward H. Sargent, Zhi Wei Seh

Abstract: Transitioning from fossil fuels to renewable energy sources is a critical global challenge; it demands advances at the levels of materials, devices, and systems for the efficient harvesting, storage, conversion, and management of renewable energy. Researchers globally have begun incorporating machine learning (ML) techniques with the aim of accelerating these advances. ML technologies leverage sta… ▽ More Transitioning from fossil fuels to renewable energy sources is a critical global challenge; it demands advances at the levels of materials, devices, and systems for the efficient harvesting, storage, conversion, and management of renewable energy. Researchers globally have begun incorporating machine learning (ML) techniques with the aim of accelerating these advances. ML technologies leverage statistical trends in data to build models for prediction of material properties, generation of candidate structures, optimization of processes, among other uses; as a result, they can be incorporated into discovery and development pipelines to accelerate progress. Here we review recent advances in ML-driven energy research, outline current and future challenges, and describe what is required moving forward to best lever ML techniques. To start, we give an overview of key ML concepts. We then introduce a set of key performance indicators to help compare the benefits of different ML-accelerated workflows for energy research. We discuss and evaluate the latest advances in applying ML to the development of energy harvesting (photovoltaics), storage (batteries), conversion (electrocatalysis), and management (smart grids). Finally, we offer an outlook of potential research areas in the energy field that stand to further benefit from the application of ML. △ Less

Submitted 19 October, 2022; originally announced October 2022.

arXiv:2206.06217 [pdf, other]

Towards an Approximation-Aware Computational Workflow Framework for Accelerating Large-Scale Discovery Tasks

Authors: Michael A. Johnston, Vassilis Vassiliadis

Abstract: The use of approximation is fundamental in computational science. Almost all computational methods adopt approximations in some form in order to obtain a favourable cost/accuracy trade-off and there are usually many approximations that could be used. As a result, when a researcher wishes to measure a property of a system with a computational technique, they are faced with an array of options. Curr… ▽ More The use of approximation is fundamental in computational science. Almost all computational methods adopt approximations in some form in order to obtain a favourable cost/accuracy trade-off and there are usually many approximations that could be used. As a result, when a researcher wishes to measure a property of a system with a computational technique, they are faced with an array of options. Current computational workflow frameworks focus on helping researchers automate a sequence of steps on a particular platform. The aim is often to obtain a computational measurement of a property. However these frameworks are unaware that there may be a large number of ways to do so. As such, they cannot support researchers in making these choices during development or at execution-time. We argue that computational workflow frameworks should be designed to be \textit{approximation-aware} - that is, support the fact that a given workflow description represents a task that \textit{could} be performed in different ways. This is key to unlocking the potential of computational workflows to accelerate discovery tasks, particularly those involving searches of large entity spaces. It will enable efficiently obtaining measurements of entity properties, given a set of constraints, by directly leveraging the space of choices available. In this paper we describe the basic functions that an approximation-aware workflow framework should provide, how those functions can be realized in practice, and illustrate some of the powerful capabilities it would enable, including approximate memoization, surrogate model support, and automated workflow composition. △ Less

Submitted 13 June, 2022; originally announced June 2022.

Comments: Pre-print of paper in ApPLIED 2022 (part of PODC 2022)

arXiv:2003.13951 [pdf, other]

Self-supervised Monocular Trained Depth Estimation using Self-attention and Discrete Disparity Volume

Authors: Adrian Johnston, Gustavo Carneiro

Abstract: Monocular depth estimation has become one of the most studied applications in computer vision, where the most accurate approaches are based on fully supervised learning models. However, the acquisition of accurate and large ground truth data sets to model these fully supervised methods is a major challenge for the further development of the area. Self-supervised methods trained with monocular vide… ▽ More Monocular depth estimation has become one of the most studied applications in computer vision, where the most accurate approaches are based on fully supervised learning models. However, the acquisition of accurate and large ground truth data sets to model these fully supervised methods is a major challenge for the further development of the area. Self-supervised methods trained with monocular videos constitute one the most promising approaches to mitigate the challenge mentioned above due to the wide-spread availability of training data. Consequently, they have been intensively studied, where the main ideas explored consist of different types of model architectures, loss functions, and occlusion masks to address non-rigid motion. In this paper, we propose two new ideas to improve self-supervised monocular trained depth estimation: 1) self-attention, and 2) discrete disparity prediction. Compared with the usual localised convolution operation, self-attention can explore a more general contextual information that allows the inference of similar disparity values at non-contiguous regions of the image. Discrete disparity prediction has been shown by fully supervised methods to provide a more robust and sharper depth estimation than the more common continuous disparity prediction, besides enabling the estimation of depth uncertainty. We show that the extension of the state-of-the-art self-supervised monocular trained depth estimator Monodepth2 with these two ideas allows us to design a model that produces the best results in the field in KITTI 2015 and Make3D, closing the gap with respect self-supervised stereo training and fully supervised approaches. △ Less

Submitted 31 March, 2020; originally announced March 2020.

arXiv:1904.02382 [pdf, other]

Inferring Dynamic Representations of Facial Actions from a Still Image

Authors: Siyang Song, Enrique Sánchez-Lozano, Linlin Shen, Alan Johnston, Michel Valstar

Abstract: Facial actions are spatio-temporal signals by nature, and therefore their modeling is crucially dependent on the availability of temporal information. In this paper, we focus on inferring such temporal dynamics of facial actions when no explicit temporal information is available, i.e. from still images. We present a novel approach to capture multiple scales of such temporal dynamics, with an appli… ▽ More Facial actions are spatio-temporal signals by nature, and therefore their modeling is crucially dependent on the availability of temporal information. In this paper, we focus on inferring such temporal dynamics of facial actions when no explicit temporal information is available, i.e. from still images. We present a novel approach to capture multiple scales of such temporal dynamics, with an application to facial Action Unit (AU) intensity estimation and dimensional affect estimation. In particular, 1) we propose a framework that infers a dynamic representation (DR) from a still image, which captures the bi-directional flow of time within a short time-window centered at the input image; 2) we show that we can train our method without the need of explicitly generating target representations, allowing the network to represent dynamics more broadly; and 3) we propose to apply a multiple temporal scale approach that infers DRs for different window lengths (MDR) from a still image. We empirically validate the value of our approach on the task of frame ranking, and show how our proposed MDR attains state of the art results on BP4D for AU intensity estimation and on SEMAINE for dimensional affect estimation, using only still images at test time. △ Less

Submitted 4 April, 2019; originally announced April 2019.

Comments: 10 pages, 5 figures

MSC Class: 65D19

arXiv:1901.02399 [pdf, ps, other]

Service Rate Region of Content Access from Erasure Coded Storage

Authors: Sarah Anderson, Ann Johnston, Gauri Joshi, Gretchen Matthews, Carolyn Mayer, Emina Soljanin

Abstract: We consider storage systems in which $K$ files are stored over $N$ nodes. A node may be systematic for a particular file in the sense that access to it gives access to the file. Alternatively, a node may be coded, meaning that it gives access to a particular file only when combined with other nodes (which may be coded or systematic). Requests for file $f_k$ arrive at rate $λ_k$, and we are interes… ▽ More We consider storage systems in which $K$ files are stored over $N$ nodes. A node may be systematic for a particular file in the sense that access to it gives access to the file. Alternatively, a node may be coded, meaning that it gives access to a particular file only when combined with other nodes (which may be coded or systematic). Requests for file $f_k$ arrive at rate $λ_k$, and we are interested in the rate that can be served by a particular system. In this paper, we determine the set of request arrival rates for the a $3$-file coded storage system. We also provide an algorithm to maximize the rate of requests served for file $K$ given $λ_1,\dots, λ_{K-1}$ in a general $K$-file case. △ Less

Submitted 8 January, 2019; originally announced January 2019.

Comments: To be published in the Proceedings of the 2018 Information Theory Workshop

arXiv:1710.03376 [pdf, other]

On the Service Capacity Region of Accessing Erasure Coded Content

Authors: Mehmet Aktas, Sarah E. Anderson, Ann Johnston, Gauri Joshi, Swanand Kadhe, Gretchen L. Matthews, Carolyn Mayer, Emina Soljanin

Abstract: Cloud storage systems generally add redundancy in storing content files such that $K$ files are replicated or erasure coded and stored on $N > K$ nodes. In addition to providing reliability against failures, the redundant copies can be used to serve a larger volume of content access requests. A request for one of the files can be either be sent to a systematic node, or one of the repair groups. In… ▽ More Cloud storage systems generally add redundancy in storing content files such that $K$ files are replicated or erasure coded and stored on $N > K$ nodes. In addition to providing reliability against failures, the redundant copies can be used to serve a larger volume of content access requests. A request for one of the files can be either be sent to a systematic node, or one of the repair groups. In this paper, we seek to maximize the service capacity region, that is, the set of request arrival rates for the $K$ files that can be supported by a coded storage system. We explore two aspects of this problem: 1) for a given erasure code, how to optimally split incoming requests between systematic nodes and repair groups, and 2) choosing an underlying erasure code that maximizes the achievable service capacity region. In particular, we consider MDS and Simplex codes. Our analysis demonstrates that erasure coding makes the system more robust to skews in file popularity than simply replicating a file at multiple servers, and that coding and replication together can make the capacity region larger than either alone. △ Less

Submitted 9 October, 2017; originally announced October 2017.

Comments: To be published in 2017 55th Annual Allerton Conference on Communication, Control, and Computing

arXiv:1209.4365 [pdf, other]

Stochastic Stabilization of Partially Observed and Multi-Sensor Systems Driven by Gaussian Noise under Fixed-Rate Information Constraints

Authors: Andrew P. Johnston, Serdar Yüksel

Abstract: We investigate the stabilization of unstable multidimensional partially observed single-sensor and multi-sensor linear systems driven by unbounded noise and controlled over discrete noiseless channels under fixed-rate information constraints. Stability is achieved under fixed-rate communication requirements that are asymptotically tight in the limit of large sampling periods. Through the use of si… ▽ More We investigate the stabilization of unstable multidimensional partially observed single-sensor and multi-sensor linear systems driven by unbounded noise and controlled over discrete noiseless channels under fixed-rate information constraints. Stability is achieved under fixed-rate communication requirements that are asymptotically tight in the limit of large sampling periods. Through the use of similarity transforms, sampling and random-time drift conditions we obtain a coding and control policy leading to the existence of a unique invariant distribution and finite second moment for the sampled state. We use a vector stabilization scheme in which all modes of the linear system visit a compact set together infinitely often. We prove tight necessary and sufficient conditions for the general multi-sensor case under an assumption related to the Jordan form structure of such systems. In the absence of this assumption, we give sufficient conditions for stabilization. △ Less

Submitted 19 September, 2012; originally announced September 2012.

Comments: 31 pages, 2 figures. This paper is to appear in part at the IEEE Conference on Decision and Control, Hawaii, 2012

arXiv:1106.6333 [pdf]

SIP APIs for Voice and Video Communications on the Web

Authors: Carol Davids, Alan Johnston, Kundan Singh, Henry Sinnreich, Wilhelm Wimmreuter

Abstract: Existing standard protocols for the web and Internet telephony fail to deliver real-time interactive communication from within a web browser. In particular, the client-server web protocol over reliable TCP is not always suitable for end-to-end low latency media path needed for interactive voice and video communication. To solve this, we compare the available platform options using the existing tec… ▽ More Existing standard protocols for the web and Internet telephony fail to deliver real-time interactive communication from within a web browser. In particular, the client-server web protocol over reliable TCP is not always suitable for end-to-end low latency media path needed for interactive voice and video communication. To solve this, we compare the available platform options using the existing technologies such as modifying the web programming language and protocol, using an existing web browser plugin, and a separate host resident application that the web browser can talk to. We argue that using a separate application as an adaptor is a promising short term as well as long-term strategy for voice and video communications on the web. Our project aims at developing the open technology and sample implementations for web-based real-time voice and video communication applications. We describe the architecture of our project including (1) a RESTful web communication API over HTTP inspired by SIP message flows, (2) a web-friendly set of metadata for session description, and (3) an UDP-based end-to-end media path. All other telephony functions reside in the web application itself and/or in web feature servers. The adaptor approach allows us to easily add new voice and video codecs and NAT traversal technologies such as Host Identity Protocol. We want to make web-based communication accessible to millions of web developers, maximize the end user experience and security, and preserve the huge global investment in and experience from SIP systems while adhering to web standards and development tools as much as possible. We have created an open source prototype that allows you to freely use the conference application by directing a browser to the conference URL. △ Less

Submitted 30 June, 2011; originally announced June 2011.

Comments: Accepted at IPTcomm 2011, 7 pages, 4 figures

Showing 1–13 of 13 results for author: Johnston, A