subscribe to arXiv mailings

BiomedParse: a biomedical foundation model for image parsing of everything everywhere all at once

Authors: Theodore Zhao, Yu Gu, Jianwei Yang, Naoto Usuyama, Ho Hin Lee, Tristan Naumann, Jianfeng Gao, Angela Crabtree, Jacob Abel, Christine Moung-Wen, Brian Piening, Carlo Bifulco, Mu Wei, Hoifung Poon, Sheng Wang

Abstract: Biomedical image analysis is fundamental for biomedical discovery in cell biology, pathology, radiology, and many other biomedical domains. Holistic image analysis comprises interdependent subtasks such as segmentation, detection, and recognition of relevant objects. Here, we propose BiomedParse, a biomedical foundation model for imaging parsing that can jointly conduct segmentation, detection, an… ▽ More Biomedical image analysis is fundamental for biomedical discovery in cell biology, pathology, radiology, and many other biomedical domains. Holistic image analysis comprises interdependent subtasks such as segmentation, detection, and recognition of relevant objects. Here, we propose BiomedParse, a biomedical foundation model for imaging parsing that can jointly conduct segmentation, detection, and recognition for 82 object types across 9 imaging modalities. Through joint learning, we can improve accuracy for individual tasks and enable novel applications such as segmenting all relevant objects in an image through a text prompt, rather than requiring users to laboriously specify the bounding box for each object. We leveraged readily available natural-language labels or descriptions accompanying those datasets and use GPT-4 to harmonize the noisy, unstructured text information with established biomedical object ontologies. We created a large dataset comprising over six million triples of image, segmentation mask, and textual description. On image segmentation, we showed that BiomedParse is broadly applicable, outperforming state-of-the-art methods on 102,855 test image-mask-label triples across 9 imaging modalities (everything). On object detection, which aims to locate a specific object of interest, BiomedParse again attained state-of-the-art performance, especially on objects with irregular shapes (everywhere). On object recognition, which aims to identify all objects in a given image along with their semantic types, we showed that BiomedParse can simultaneously segment and label all biomedical objects in an image (all at once). In summary, BiomedParse is an all-in-one tool for biomedical image analysis by jointly solving segmentation, detection, and recognition for all major biomedical image modalities, paving the path for efficient and accurate image-based biomedical discovery. △ Less

Submitted 4 June, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

Comments: Project page: https://aka.ms/biomedparse-project

arXiv:2403.10134 [pdf, other]

Measurement of groomed event shape observables in deep-inelastic electron-proton scattering at HERA

Authors: The H1 collaboration, V. Andreev, M. Arratia, A. Baghdasaryan, A. Baty, K. Begzsuren, A. Bolz, V. Boudry, G. Brandt, D. Britzger, A. Buniatyan, L. Bystritskaya, A. J. Campbell, K. B. Cantun Avila, K. Cerny, V. Chekelian, Z. Chen, J. G. Contreras, J. Cvach, J. B. Dainton, K. Daum, A. Deshpande, C. Diaconu, A. Drees, G. Eckerlin , et al. (123 additional authors not shown)

Abstract: The H1 Collaboration at HERA reports the first measurement of groomed event shape observables in deep inelastic electron-proton scattering (DIS) at $\sqrt{s}=319$ GeV, using data recorded between the years 2003 and 2007 with an integrated luminosity of $351$ pb$^{-1}$. Event shapes provide incisive probes of perturbative and non-perturbative QCD. Grooming techniques have been used for jet measurem… ▽ More The H1 Collaboration at HERA reports the first measurement of groomed event shape observables in deep inelastic electron-proton scattering (DIS) at $\sqrt{s}=319$ GeV, using data recorded between the years 2003 and 2007 with an integrated luminosity of $351$ pb$^{-1}$. Event shapes provide incisive probes of perturbative and non-perturbative QCD. Grooming techniques have been used for jet measurements in hadronic collisions; this paper presents the first application of grooming to DIS data. The analysis is carried out in the Breit frame, utilizing the novel Centauro jet clustering algorithm that is designed for DIS event topologies. Events are required to have squared momentum-transfer $Q^2 > 150$ GeV$^2$ and inelasticity $ 0.2 < y < 0.7$. We report measurements of the production cross section of groomed event 1-jettiness and groomed invariant mass for several choices of grooming parameter. Monte Carlo model calculations and analytic calculations based on Soft Collinear Effective Theory are compared to the measurements. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: 32 pages, 17 tables, 7 figures, submitted to EPJ C

Report number: DESY-24-036

arXiv:2403.10109 [pdf, other]

Measurement of the 1-jettiness event shape observable in deep-inelastic electron-proton scattering at HERA

Authors: The H1 collaboration, V. Andreev, M. Arratia, A. Baghdasaryan, A. Baty, K. Begzsuren, A. Bolz, V. Boudry, G. Brandt, D. Britzger, A. Buniatyan, L. Bystritskaya, A. J. Campbell, K. B. Cantun Avila, K. Cerny, V. Chekelian, Z. Chen, J. G. Contreras, J. Cvach, J. B. Dainton, K. Daum, A. Deshpande, C. Diaconu, A. Drees, G. Eckerlin , et al. (124 additional authors not shown)

Abstract: The H1 Collaboration reports the first measurement of the 1-jettiness event shape observable $τ_1^b$ in neutral-current deep-inelastic electron-proton scattering (DIS). The observable $τ_1^b$ is equivalent to a thrust observable defined in the Breit frame. The data sample was collected at the HERA $ep$ collider in the years 2003-2007 with center-of-mass energy of $\sqrt{s}=319\,\text{GeV}$, corres… ▽ More The H1 Collaboration reports the first measurement of the 1-jettiness event shape observable $τ_1^b$ in neutral-current deep-inelastic electron-proton scattering (DIS). The observable $τ_1^b$ is equivalent to a thrust observable defined in the Breit frame. The data sample was collected at the HERA $ep$ collider in the years 2003-2007 with center-of-mass energy of $\sqrt{s}=319\,\text{GeV}$, corresponding to an integrated luminosity of $351.1\,\text{pb}^{-1}$. Triple differential cross sections are provided as a function of $τ_1^b$, event virtuality $Q^2$, and inelasticity $y$, in the kinematic region $Q^2>150\,\text{GeV}^{2}$. Single differential cross section are provided as a function of $τ_1^b$ in a limited kinematic range. Double differential cross sections are measured, in contrast, integrated over $τ_1^b$ and represent the inclusive neutral-current DIS cross section measured as a function of $Q^2$ and $y$. The data are compared to a variety of predictions and include classical and modern Monte Carlo event generators, predictions in fixed-order perturbative QCD where calculations up to $\mathcal{O}(α_s^3)$ are available for $τ_1^b$ or inclusive DIS, and resummed predictions at next-to-leading logarithmic accuracy matched to fixed order predictions at $\mathcal{O}(α_s^2)$. These comparisons reveal sensitivity of the 1-jettiness observable to QCD parton shower and resummation effects, as well as the modeling of hadronization and fragmentation. Within their range of validity, the fixed-order predictions provide a good description of the data. Monte Carlo event generators are predictive over the full measured range and hence their underlying models and parameters can be constrained by comparing to the presented data. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: 45 pages, 38 tables, 13 figures

Report number: DESY-24-035

arXiv:2403.08982 [pdf, other]

Observation and differential cross section measurement of neutral current DIS events with an empty hemisphere in the Breit frame

Authors: The H1 collaboration, V. Andreev, M. Arratia, A. Baghdasaryan, A. Baty, K. Begzsuren, A. Bolz, V. Boudry, G. Brandt, D. Britzger, A. Buniatyan, L. Bystritskaya, A. J. Campbell, K. B. Cantun Avila, K. Cerny, V. Chekelian, Z. Chen, J. G. Contreras, J. Cvach, J. B. Dainton, K. Daum, A. Deshpande, C. Diaconu, A. Drees, G. Eckerlin , et al. (124 additional authors not shown)

Abstract: The Breit frame provides a natural frame to analyze lepton-proton scattering events. In this reference frame, the parton model hard interactions between a quark and an exchanged boson defines the coordinate system such that the struck quark is back-scattered along the virtual photon momentum direction. In Quantum Chromodynamics (QCD), higher order perturbative or non-perturbative effects can chang… ▽ More The Breit frame provides a natural frame to analyze lepton-proton scattering events. In this reference frame, the parton model hard interactions between a quark and an exchanged boson defines the coordinate system such that the struck quark is back-scattered along the virtual photon momentum direction. In Quantum Chromodynamics (QCD), higher order perturbative or non-perturbative effects can change this picture drastically. As Bjorken-$x$ decreases below one half, a rather peculiar event signature is predicted with increasing probability, where no radiation is present in one of the two Breit-frame hemispheres and all emissions are to be found in the other hemisphere. At higher orders in $α_s$ or in the presence of soft QCD effects, predictions of the rate of these events are far from trivial, and that motivates measurements with real data. We report on the first observation of the empty current hemisphere events in electron-proton collisions at the HERA collider using data recorded with the H1 detector at a center-of-mass energy of 319 GeV. The fraction of inclusive neutral-current DIS events with an empty hemisphere is found to be $0.0112 \pm 3.9\,\%_\text{stat} \pm 4.5\,\%_\text{syst} \pm 1.6\,\%_\text{mod}$ in the selected kinematic region of $150< Q^2<1500$ GeV$^2$ and inelasticity $0.14< y<0.7$. The data sample corresponds to an integrated luminosity of 351.1 pb$^{-1}$, sufficient to enable differential cross section measurements of these events. The results show an enhanced discriminating power at lower Bjorken-$x$ among different Monte Carlo event generator predictions. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: 13 pages, 5 figures, 2 Tables

Report number: DESY-24-034

arXiv:2403.08002 [pdf, other]

Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation

Authors: Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, Ziyi Yang, Hany Awadalla, Julia Gong, Houdong Hu, Jianwei Yang, Chunyuan Li, Jianfeng Gao, Yu Gu, Cliff Wong, Mu Wei, Tristan Naumann, Muhao Chen, Matthew P. Lungren, Akshay Chaudhari, Serena Yeung-Levy, Curtis P. Langlotz , et al. (2 additional authors not shown)

Abstract: The scaling laws and extraordinary performance of large foundation models motivate the development and utilization of such models in biomedicine. However, despite early promising results on some biomedical benchmarks, there are still major challenges that need to be addressed before these models can be used in real-world clinics. Frontier general-domain models such as GPT-4V still have significant… ▽ More The scaling laws and extraordinary performance of large foundation models motivate the development and utilization of such models in biomedicine. However, despite early promising results on some biomedical benchmarks, there are still major challenges that need to be addressed before these models can be used in real-world clinics. Frontier general-domain models such as GPT-4V still have significant performance gaps in multimodal biomedical applications. More importantly, less-acknowledged pragmatic issues, including accessibility, model cost, and tedious manual evaluation make it hard for clinicians to use state-of-the-art large models directly on private patient data. Here, we explore training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. To maximize data efficiency, we adopt a modular approach by incorporating state-of-the-art pre-trained models for image and text modalities, and focusing on training a lightweight adapter to ground each modality to the text embedding space, as exemplified by LLaVA-Med. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. For best practice, we conduct a systematic ablation study on various choices in data engineering and multimodal training. The resulting LlaVA-Rad (7B) model attains state-of-the-art results on standard radiology tasks such as report generation and cross-modal retrieval, even outperforming much larger models such as GPT-4V and Med-PaLM M (84B). The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications. △ Less

Submitted 26 June, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.01628 [pdf, ps, other]

Recent Advances, Applications, and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2023 Symposium

Authors: Hyewon Jeong, Sarah Jabbour, Yuzhe Yang, Rahul Thapta, Hussein Mozannar, William Jongwon Han, Nikita Mehandru, Michael Wornow, Vladislav Lialin, Xin Liu, Alejandro Lozano, Jiacheng Zhu, Rafal Dariusz Kocielnik, Keith Harrigian, Haoran Zhang, Edward Lee, Milos Vukadinovic, Aparna Balagopalan, Vincent Jeanselme, Katherine Matton, Ilker Demirel, Jason Fries, Parisa Rashidi, Brett Beaulieu-Jones, Xuhai Orson Xu , et al. (18 additional authors not shown)

Abstract: The third ML4H symposium was held in person on December 10, 2023, in New Orleans, Louisiana, USA. The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the \ac{ML4H} community. Encouraged by the successful virtual roundtables in the previous year, we organized eleven in-person roundtables and four vir… ▽ More The third ML4H symposium was held in person on December 10, 2023, in New Orleans, Louisiana, USA. The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the \ac{ML4H} community. Encouraged by the successful virtual roundtables in the previous year, we organized eleven in-person roundtables and four virtual roundtables at ML4H 2022. The organization of the research roundtables at the conference involved 17 Senior Chairs and 19 Junior Chairs across 11 tables. Each roundtable session included invited senior chairs (with substantial experience in the field), junior chairs (responsible for facilitating the discussion), and attendees from diverse backgrounds with interest in the session's topic. Herein we detail the organization process and compile takeaways from these roundtable discussions, including recent advances, applications, and open challenges for each topic. We conclude with a summary and lessons learned across all roundtables. This document serves as a comprehensive review paper, summarizing the recent advancements in machine learning for healthcare as contributed by foremost researchers in the field. △ Less

Submitted 5 April, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

Comments: ML4H 2023, Research Roundtables

arXiv:2403.01002 [pdf, other]

Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries

Authors: Zelalem Gero, Chandan Singh, Yiqing Xie, Sheng Zhang, Tristan Naumann, Jianfeng Gao, Hoifung Poon

Abstract: Summarizing clinical text is crucial in health decision-support and clinical research. Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation, especially in safety-critical domains such as health. Holistically evaluating text summaries is challenging because they may contain unsubstantiat… ▽ More Summarizing clinical text is crucial in health decision-support and clinical research. Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation, especially in safety-critical domains such as health. Holistically evaluating text summaries is challenging because they may contain unsubstantiated information. Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process. It decomposes the evaluation process into a grounded procedure that uses an LLM for relatively simple structuring and scoring tasks, rather than the full task of holistic summary evaluation. Experiments show that AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization. Additionally, AS yields interpretations in the form of a short text span corresponding to each output, which enables efficient human auditing, paving the way towards trustworthy evaluation of clinical information in resource-constrained scenarios. We release our code, prompts, and an open-source benchmark at https://github.com/microsoft/attribute-structuring. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: 4 pages

arXiv:2311.09581 [pdf, other]

DocLens: Multi-aspect Fine-grained Evaluation for Medical Text Generation

Authors: Yiqing Xie, Sheng Zhang, Hao Cheng, Pengfei Liu, Zelalem Gero, Cliff Wong, Tristan Naumann, Hoifung Poon, Carolyn Rose

Abstract: Medical text generation aims to assist with administrative work and highlight salient information to support decision-making. To reflect the specific requirements of medical text, in this paper, we propose a set of metrics to evaluate the completeness, conciseness, and attribution of the generated text at a fine-grained level. The metrics can be computed by various types of evaluators including in… ▽ More Medical text generation aims to assist with administrative work and highlight salient information to support decision-making. To reflect the specific requirements of medical text, in this paper, we propose a set of metrics to evaluate the completeness, conciseness, and attribution of the generated text at a fine-grained level. The metrics can be computed by various types of evaluators including instruction-following (both proprietary and open-source) and supervised entailment models. We demonstrate the effectiveness of the resulting framework, DocLens, with three evaluators on three tasks: clinical note generation, radiology report summarization, and patient question summarization. A comprehensive human study shows that DocLens exhibits substantially higher agreement with the judgments of medical experts than existing metrics. The results also highlight the need to improve open-source evaluators and suggest potential directions. △ Less

Submitted 18 February, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

arXiv:2311.01301 [pdf, other]

TRIALSCOPE: A Unifying Causal Framework for Scaling Real-World Evidence Generation with Biomedical Language Models

Authors: Javier González, Cliff Wong, Zelalem Gero, Jass Bagga, Risa Ueno, Isabel Chien, Eduard Oravkin, Emre Kiciman, Aditya Nori, Roshanthi Weerasinghe, Rom S. Leidner, Brian Piening, Tristan Naumann, Carlo Bifulco, Hoifung Poon

Abstract: The rapid digitization of real-world data offers an unprecedented opportunity for optimizing healthcare delivery and accelerating biomedical discovery. In practice, however, such data is most abundantly available in unstructured forms, such as clinical notes in electronic medical records (EMRs), and it is generally plagued by confounders. In this paper, we present TRIALSCOPE, a unifying framework… ▽ More The rapid digitization of real-world data offers an unprecedented opportunity for optimizing healthcare delivery and accelerating biomedical discovery. In practice, however, such data is most abundantly available in unstructured forms, such as clinical notes in electronic medical records (EMRs), and it is generally plagued by confounders. In this paper, we present TRIALSCOPE, a unifying framework for distilling real-world evidence from population-level observational data. TRIALSCOPE leverages biomedical language models to structure clinical text at scale, employs advanced probabilistic modeling for denoising and imputation, and incorporates state-of-the-art causal inference techniques to combat common confounders. Using clinical trial specification as generic representation, TRIALSCOPE provides a turn-key solution to generate and reason with clinical hypotheses using observational data. In extensive experiments and analyses on a large-scale real-world dataset with over one million cancer patients from a large US healthcare network, we show that TRIALSCOPE can produce high-quality structuring of real-world data and generates comparable results to marquee cancer trials. In addition to facilitating in-silicon clinical trial design and optimization, TRIALSCOPE may be used to empower synthetic controls, pragmatic trials, post-market surveillance, as well as support fine-grained patient-like-me reasoning in precision diagnosis and treatment. △ Less

Submitted 6 November, 2023; v1 submitted 2 November, 2023; originally announced November 2023.

Comments: 6 Figures, 22 Pages, 3 Tables

arXiv:2308.02180 [pdf, other]

Scaling Clinical Trial Matching Using Large Language Models: A Case Study in Oncology

Authors: Cliff Wong, Sheng Zhang, Yu Gu, Christine Moung, Jacob Abel, Naoto Usuyama, Roshanthi Weerasinghe, Brian Piening, Tristan Naumann, Carlo Bifulco, Hoifung Poon

Abstract: Clinical trial matching is a key process in health delivery and discovery. In practice, it is plagued by overwhelming unstructured data and unscalable manual processing. In this paper, we conduct a systematic study on scaling clinical trial matching using large language models (LLMs), with oncology as the focus area. Our study is grounded in a clinical trial matching system currently in test deplo… ▽ More Clinical trial matching is a key process in health delivery and discovery. In practice, it is plagued by overwhelming unstructured data and unscalable manual processing. In this paper, we conduct a systematic study on scaling clinical trial matching using large language models (LLMs), with oncology as the focus area. Our study is grounded in a clinical trial matching system currently in test deployment at a large U.S. health network. Initial findings are promising: out of box, cutting-edge LLMs, such as GPT-4, can already structure elaborate eligibility criteria of clinical trials and extract complex matching logic (e.g., nested AND/OR/NOT). While still far from perfect, LLMs substantially outperform prior strong baselines and may serve as a preliminary solution to help triage patient-trial candidates with humans in the loop. Our study also reveals a few significant growth areas for applying LLMs to end-to-end clinical trial matching, such as context limitation and accuracy, especially in structuring patient information from longitudinal medical records. △ Less

Submitted 18 August, 2023; v1 submitted 4 August, 2023; originally announced August 2023.

Comments: 24 pages, 5 figures, accepted at Machine Learning for Healthcare (MLHC) 2023

arXiv:2307.06439 [pdf, other]

Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events

Authors: Yu Gu, Sheng Zhang, Naoto Usuyama, Yonas Woldesenbet, Cliff Wong, Praneeth Sanapathi, Mu Wei, Naveen Valluri, Erika Strandberg, Tristan Naumann, Hoifung Poon

Abstract: Large language models (LLMs), such as GPT-4, have demonstrated remarkable capabilities across a wide range of tasks, including health applications. In this paper, we study how LLMs can be used to scale biomedical knowledge curation. We find that while LLMs already possess decent competency in structuring biomedical text, by distillation into a task-specific student model through self-supervised le… ▽ More Large language models (LLMs), such as GPT-4, have demonstrated remarkable capabilities across a wide range of tasks, including health applications. In this paper, we study how LLMs can be used to scale biomedical knowledge curation. We find that while LLMs already possess decent competency in structuring biomedical text, by distillation into a task-specific student model through self-supervised learning, substantial gains can be attained over out-of-box LLMs, with additional advantages such as cost, efficiency, and white-box model access. We conduct a case study on adverse drug event (ADE) extraction, which is an important area for improving care. On standard ADE extraction evaluation, a GPT-3.5 distilled PubMedBERT model attained comparable accuracy as supervised state-of-the-art models without using any labeled data. Despite being over 1,000 times smaller, the distilled model outperformed its teacher GPT-3.5 by over 6 absolute points in F1 and GPT-4 by over 5 absolute points. Ablation studies on distillation model choice (e.g., PubMedBERT vs BioGPT) and ADE extraction architecture shed light on best practice for biomedical knowledge extraction. Similar gains were attained by distillation for other standard biomedical knowledge extraction tasks such as gene-disease associations and protected health information, further illustrating the promise of this approach. △ Less

Submitted 12 July, 2023; originally announced July 2023.

arXiv:2306.00890 [pdf, other]

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Authors: Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, Jianfeng Gao

Abstract: Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical imag… ▽ More Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: 17 pages; Website: https://aka.ms/llava-med

arXiv:2306.00024 [pdf, other]

Self-Verification Improves Few-Shot Clinical Information Extraction

Authors: Zelalem Gero, Chandan Singh, Hao Cheng, Tristan Naumann, Michel Galley, Jianfeng Gao, Hoifung Poon

Abstract: Extracting patient information from unstructured text is a critical task in health decision-support and clinical research. Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning, in contrast to supervised learning which requires much more costly human annotations. However, despite drastic advances in modern LLMs such as GPT-4, they st… ▽ More Extracting patient information from unstructured text is a critical task in health decision-support and clinical research. Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning, in contrast to supervised learning which requires much more costly human annotations. However, despite drastic advances in modern LLMs such as GPT-4, they still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs. This is made possible by the asymmetry between verification and generation, where the latter is often much easier than the former. Experimental results show that our method consistently improves accuracy for various LLMs in standard clinical information extraction tasks. Additionally, self-verification yields interpretations in the form of a short text span corresponding to each output, which makes it very efficient for human experts to audit the results, paving the way towards trustworthy extraction of clinical information in resource-constrained scenarios. To facilitate future research in this direction, we release our code and prompts. △ Less

Submitted 30 May, 2023; originally announced June 2023.

Journal ref: IMLH 2023

arXiv:2305.17588 [pdf, other]

Diagnosing Transformers: Illuminating Feature Spaces for Clinical Decision-Making

Authors: Aliyah R. Hsu, Yeshwanth Cherapanamjeri, Briton Park, Tristan Naumann, Anobel Y. Odisho, Bin Yu

Abstract: Pre-trained transformers are often fine-tuned to aid clinical decision-making using limited clinical notes. Model interpretability is crucial, especially in high-stakes domains like medicine, to establish trust and ensure safety, which requires human engagement. We introduce SUFO, a systematic framework that enhances interpretability of fine-tuned transformer feature spaces. SUFO utilizes a range… ▽ More Pre-trained transformers are often fine-tuned to aid clinical decision-making using limited clinical notes. Model interpretability is crucial, especially in high-stakes domains like medicine, to establish trust and ensure safety, which requires human engagement. We introduce SUFO, a systematic framework that enhances interpretability of fine-tuned transformer feature spaces. SUFO utilizes a range of analytic and visualization techniques, including Supervised probing, Unsupervised similarity analysis, Feature dynamics, and Outlier analysis to address key questions about model trust and interpretability. We conduct a case study investigating the impact of pre-training data where we focus on real-world pathology classification tasks, and validate our findings on MedNLI. We evaluate five 110M-sized pre-trained transformer models, categorized into general-domain (BERT, TNLR), mixed-domain (BioBERT, Clinical BioBERT), and domain-specific (PubMedBERT) groups. Our SUFO analyses reveal that: (1) while PubMedBERT, the domain-specific model, contains valuable information for fine-tuning, it can overfit to minority classes when class imbalances exist. In contrast, mixed-domain models exhibit greater resistance to overfitting, suggesting potential improvements in domain-specific model robustness; (2) in-domain pre-training accelerates feature disambiguation during fine-tuning; and (3) feature spaces undergo significant sparsification during this process, enabling clinicians to identify common outlier modes among fine-tuned models as demonstrated in this paper. These findings showcase the utility of SUFO in enhancing trust and safety when using transformers in medicine, and we believe SUFO can aid practitioners in evaluating fine-tuned language models for other applications in medicine and in more critical domains. △ Less

Submitted 26 February, 2024; v1 submitted 27 May, 2023; originally announced May 2023.

arXiv:2305.07615 [pdf, other]

What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization

Authors: Griffin Adams, Bichlien H Nguyen, Jake Smith, Yingce Xia, Shufang Xie, Anna Ostropolets, Budhaditya Deb, Yuan-Jyue Chen, Tristan Naumann, Noémie Elhadad

Abstract: Summarization models often generate text that is poorly calibrated to quality metrics because they are trained to maximize the likelihood of a single reference (MLE). To address this, recent work has added a calibration step, which exposes a model to its own ranked outputs to improve relevance or, in a separate line of work, contrasts positive and negative sets to improve faithfulness. While effec… ▽ More Summarization models often generate text that is poorly calibrated to quality metrics because they are trained to maximize the likelihood of a single reference (MLE). To address this, recent work has added a calibration step, which exposes a model to its own ranked outputs to improve relevance or, in a separate line of work, contrasts positive and negative sets to improve faithfulness. While effective, much of this work has focused on how to generate and optimize these sets. Less is known about why one setup is more effective than another. In this work, we uncover the underlying characteristics of effective sets. For each training instance, we form a large, diverse pool of candidates and systematically vary the subsets used for calibration fine-tuning. Each selection strategy targets distinct aspects of the sets, such as lexical diversity or the size of the gap between positive and negatives. On three diverse scientific long-form summarization datasets (spanning biomedical, clinical, and chemical domains), we find, among others, that faithfulness calibration is optimal when the negative sets are extractive and more likely to be generated, whereas for relevance calibration, the metric margin between candidates should be maximized and surprise--the disagreement between model and metric defined candidate rankings--minimized. Code to create, select, and optimize calibration sets is available at https://github.com/griff4692/calibrating-summaries △ Less

Submitted 12 May, 2023; originally announced May 2023.

Comments: ACL 2023

arXiv:2303.13620 [pdf, other]

doi 10.1016/j.physletb.2023.138101

Unbinned Deep Learning Jet Substructure Measurement in High $Q^2$ ep collisions at HERA

Authors: The H1 collaboration, V. Andreev, M. Arratia, A. Baghdasaryan, A. Baty, K. Begzsuren, A. Bolz, V. Boudry, G. Brandt, D. Britzger, A. Buniatyan, L. Bystritskaya, A. J. Campbell, K. B. Cantun Avila, K. Cerny, V. Chekelian, Z. Chen, J. G. Contreras, J. Cvach, J. B. Dainton, K. Daum, A. Deshpande, C. Diaconu, A. Drees, G. Eckerlin , et al. (120 additional authors not shown)

Abstract: The radiation pattern within high energy quark- and gluon-initiated jets (jet substructure) is used extensively as a precision probe of the strong force as well as an environment for optimizing event generators with numerous applications in high energy particle and nuclear physics. Looking at electron-proton collisions is of particular interest as many of the complications present at hadron collid… ▽ More The radiation pattern within high energy quark- and gluon-initiated jets (jet substructure) is used extensively as a precision probe of the strong force as well as an environment for optimizing event generators with numerous applications in high energy particle and nuclear physics. Looking at electron-proton collisions is of particular interest as many of the complications present at hadron colliders are absent. A detailed study of modern jet substructure observables, jet angularities, in electron-proton collisions is presented using data recorded using the H1 detector at HERA. The measurement is unbinned and multi-dimensional, using machine learning to correct for detector effects. All of the available reconstructed object information of the respective jets is interpreted by a graph neural network, achieving superior precision on a selected set of jet angularities. Training these networks was enabled by the use of a large number of GPUs in the Perlmutter supercomputer at Berkeley Lab. The particle jets are reconstructed in the laboratory frame, using the $k_{\mathrm{T}}$ jet clustering algorithm. Results are reported at high transverse momentum transfer $Q^2>150$ GeV${}^2$, and inelasticity $0.2 < y < 0.7$. The analysis is also performed in sub-regions of $Q^2$, thus probing scale dependencies of the substructure variables. The data are compared with a variety of predictions and point towards possible improvements of such models. △ Less

Submitted 14 September, 2023; v1 submitted 23 March, 2023; originally announced March 2023.

Comments: 25 pages, 10 figures, 8 tables, version accepted by Physics Letters B

Report number: DESY-23-034

Journal ref: PLB 844 (2023) 138101

arXiv:2303.13386 [pdf, other]

Compositional Zero-Shot Domain Transfer with Text-to-Text Models

Authors: Fangyu Liu, Qianchu Liu, Shruthi Bannur, Fernando Pérez-García, Naoto Usuyama, Sheng Zhang, Tristan Naumann, Aditya Nori, Hoifung Poon, Javier Alvarez-Valle, Ozan Oktay, Stephanie L. Hyland

Abstract: Label scarcity is a bottleneck for improving task performance in specialised domains. We propose a novel compositional transfer learning framework (DoT5 - domain compositional zero-shot T5) for zero-shot domain transfer. Without access to in-domain labels, DoT5 jointly learns domain knowledge (from MLM of unlabelled in-domain free text) and task knowledge (from task training on more readily availa… ▽ More Label scarcity is a bottleneck for improving task performance in specialised domains. We propose a novel compositional transfer learning framework (DoT5 - domain compositional zero-shot T5) for zero-shot domain transfer. Without access to in-domain labels, DoT5 jointly learns domain knowledge (from MLM of unlabelled in-domain free text) and task knowledge (from task training on more readily available general-domain data) in a multi-task manner. To improve the transferability of task training, we design a strategy named NLGU: we simultaneously train NLG for in-domain label-to-data generation which enables data augmentation for self-finetuning and NLU for label prediction. We evaluate DoT5 on the biomedical domain and the resource-lean subdomain of radiology, focusing on NLI, text summarisation and embedding learning. DoT5 demonstrates the effectiveness of compositional transfer learning through multi-task learning. In particular, DoT5 outperforms the current SOTA in zero-shot transfer by over 7 absolute points in accuracy on RadNLI. We validate DoT5 with ablations and a case study demonstrating its ability to solve challenging NLI examples requiring in-domain expertise. △ Less

Submitted 23 March, 2023; originally announced March 2023.

Comments: Accepted at TACL, pre-MIT Press publication version. 16 pages, 4 figures

arXiv:2303.00915 [pdf, other]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Authors: Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Matthew P. Lungren, Tristan Naumann, Sheng Wang, Hoifung Poon

Abstract: Biomedical data is inherently multimodal, comprising physical measurements and natural language narratives. A generalist biomedical AI model needs to simultaneously process different modalities of data, including text and images. Therefore, training an effective generalist biomedical model requires high-quality multimodal data, such as parallel image-text pairs. Here, we present PMC-15M, a novel d… ▽ More Biomedical data is inherently multimodal, comprising physical measurements and natural language narratives. A generalist biomedical AI model needs to simultaneously process different modalities of data, including text and images. Therefore, training an effective generalist biomedical model requires high-quality multimodal data, such as parallel image-text pairs. Here, we present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse range of biomedical image types. PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles. Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP achieved new state-of-the-art results in a wide range of standard datasets, substantially outperforming prior approaches. Intriguingly, by large-scale pretraining on diverse biomedical image types, BiomedCLIP even outperforms state-of-the-art radiology-specific models such as BioViL in radiology-specific tasks such as RSNA pneumonia detection. In summary, BiomedCLIP is a fully open-access foundation model that achieves state-of-the-art performance on various biomedical tasks, paving the way for transformative multimodal biomedical discovery and applications. We release our models at https://aka.ms/biomedclip to facilitate future research in multimodal biomedical AI. △ Less

Submitted 16 January, 2024; v1 submitted 1 March, 2023; originally announced March 2023.

Comments: The models are released at https://aka.ms/biomedclip

arXiv:2212.10823 [pdf, other]

Continual Contrastive Finetuning Improves Low-Resource Relation Extraction

Authors: Wenxuan Zhou, Sheng Zhang, Tristan Naumann, Muhao Chen, Hoifung Poon

Abstract: Relation extraction (RE), which has relied on structurally annotated corpora for model training, has been particularly challenging in low-resource scenarios and domains. Recent literature has tackled low-resource RE by self-supervised learning, where the solution involves pretraining the entity pair embedding by RE-based objective and finetuning on labeled data by classification-based objective. H… ▽ More Relation extraction (RE), which has relied on structurally annotated corpora for model training, has been particularly challenging in low-resource scenarios and domains. Recent literature has tackled low-resource RE by self-supervised learning, where the solution involves pretraining the entity pair embedding by RE-based objective and finetuning on labeled data by classification-based objective. However, a critical challenge to this approach is the gap in objectives, which prevents the RE model from fully utilizing the knowledge in pretrained representations. In this paper, we aim at bridging the gap and propose to pretrain and finetune the RE model using consistent objectives of contrastive learning. Since in this kind of representation learning paradigm, one relation may easily form multiple clusters in the representation space, we further propose a multi-center contrastive loss that allows one relation to form multiple clusters to better align with pretraining. Experiments on two document-level RE datasets, BioRED and Re-DocRED, demonstrate the effectiveness of our method. Particularly, when using 1% end-task training data, our method outperforms PLM-based RE classifier by 10.5% and 6.1% on the two datasets, respectively. △ Less

Submitted 31 May, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

Comments: ACL 2023

arXiv:2205.02752

A collection of invited non-archival papers for the Conference on Health, Inference, and Learning (CHIL) 2022

Authors: Gerardo Flores, George H. Chen, Tom Pollard, Joyce C. Ho, Tristan Naumann

Abstract: A collection of invited non-archival papers for the Conference on Health, Inference, and Learning (CHIL) 2022. This index is incomplete as some authors of invited non-archival presentations opted not to include their papers in this index. A collection of invited non-archival papers for the Conference on Health, Inference, and Learning (CHIL) 2022. This index is incomplete as some authors of invited non-archival presentations opted not to include their papers in this index. △ Less

Submitted 28 March, 2022; originally announced May 2022.

arXiv:2204.09817 [pdf, other]

doi 10.1007/978-3-031-20059-5_1

Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing

Authors: Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C. Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, Hoifung Poon, Ozan Oktay

Abstract: Multi-modal data abounds in biomedicine, such as radiology images and reports. Interpreting this data at scale is essential for improving clinical care and accelerating clinical research. Biomedical text with its complex semantics poses additional challenges in vision--language modelling compared to the general domain, and previous work has used insufficiently adapted models that lack domain-speci… ▽ More Multi-modal data abounds in biomedicine, such as radiology images and reports. Interpreting this data at scale is essential for improving clinical care and accelerating clinical research. Biomedical text with its complex semantics poses additional challenges in vision--language modelling compared to the general domain, and previous work has used insufficiently adapted models that lack domain-specific language understanding. In this paper, we show that principled textual semantic modelling can substantially improve contrastive learning in self-supervised vision--language processing. We release a language model that achieves state-of-the-art results in radiology natural language inference through its improved vocabulary and novel language pretraining objective leveraging semantics and discourse characteristics in radiology reports. Further, we propose a self-supervised joint vision--language approach with a focus on better text modelling. It establishes new state of the art results on a wide range of publicly available benchmarks, in part by leveraging our new domain-specific language model. We release a new dataset with locally-aligned phrase grounding annotations by radiologists to facilitate the study of complex semantic modelling in biomedical vision--language processing. A broad evaluation, including on this new dataset, shows that our contrastive learning approach, aided by textual-semantic modelling, outperforms prior methods in segmentation tasks, despite only using a global-alignment objective. △ Less

Submitted 21 July, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

Comments: To appear in ECCV 2022. Code: https://aka.ms/biovil-code Dataset: https://aka.ms/ms-cxr Demo Notebook: https://aka.ms/biovil-demo-notebook

Journal ref: Computer Vision - ECCV 2022, LNCS vol 13696, pp 1-21

arXiv:2203.10442 [pdf, other]

Towards Structuring Real-World Data at Scale: Deep Learning for Extracting Key Oncology Information from Clinical Text with Patient-Level Supervision

Authors: Sam Preston, Mu Wei, Rajesh Rao, Robert Tinn, Naoto Usuyama, Michael Lucas, Roshanthi Weerasinghe, Soohee Lee, Brian Piening, Paul Tittel, Naveen Valluri, Tristan Naumann, Carlo Bifulco, Hoifung Poon

Abstract: Objective: The majority of detailed patient information in real-world data (RWD) is only consistently available in free-text clinical documents. Manual curation is expensive and time-consuming. Developing natural language processing (NLP) methods for structuring RWD is thus essential for scaling real-world evidence generation. Materials and Methods: Traditional rule-based systems are vulnerable… ▽ More Objective: The majority of detailed patient information in real-world data (RWD) is only consistently available in free-text clinical documents. Manual curation is expensive and time-consuming. Developing natural language processing (NLP) methods for structuring RWD is thus essential for scaling real-world evidence generation. Materials and Methods: Traditional rule-based systems are vulnerable to the prevalent linguistic variations and ambiguities in clinical text, and prior applications of machine-learning methods typically require sentence-level or report-level labeled examples that are hard to produce at scale. We propose leveraging patient-level supervision from medical registries, which are often readily available and capture key patient information, for general RWD applications. To combat the lack of sentence-level or report-level annotations, we explore advanced deep-learning methods by combining domain-specific pretraining, recurrent neural networks, and hierarchical attention. Results: We conduct an extensive study on 135,107 patients from the cancer registry of a large integrated delivery network (IDN) comprising healthcare systems in five western US states. Our deep learning methods attain test AUROC of 94-99% for key tumor attributes and comparable performance on held-out data from separate health systems and states. Discussion and Conclusion: Ablation results demonstrate clear superiority of these advanced deep-learning methods over prior approaches. Error analysis shows that our NLP system sometimes even corrects errors in registrar labels. We also conduct a preliminary investigation in accelerating registry curation and general RWD structuring via assisted curation for over 1.2 million cancer patients in this healthcare network. △ Less

Submitted 19 March, 2022; originally announced March 2022.

arXiv:2112.07887 [pdf, other]

Knowledge-Rich Self-Supervision for Biomedical Entity Linking

Authors: Sheng Zhang, Hao Cheng, Shikhar Vashishth, Cliff Wong, Jinfeng Xiao, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, Hoifung Poon

Abstract: Entity linking faces significant challenges such as prolific variations and prevalent ambiguities, especially in high-value domains with myriad entities. Standard classification approaches suffer from the annotation bottleneck and cannot effectively handle unseen entities. Zero-shot entity linking has emerged as a promising direction for generalizing to new entities, but it still requires example… ▽ More Entity linking faces significant challenges such as prolific variations and prevalent ambiguities, especially in high-value domains with myriad entities. Standard classification approaches suffer from the annotation bottleneck and cannot effectively handle unseen entities. Zero-shot entity linking has emerged as a promising direction for generalizing to new entities, but it still requires example gold entity mentions during training and canonical descriptions for all entities, both of which are rarely available outside of Wikipedia. In this paper, we explore Knowledge-RIch Self-Supervision ($\tt KRISS$) for biomedical entity linking, by leveraging readily available domain knowledge. In training, it generates self-supervised mention examples on unlabeled text using a domain ontology and trains a contextual encoder using contrastive learning. For inference, it samples self-supervised mentions as prototypes for each entity and conducts linking by mapping the test mention to the most similar prototype. Our approach can easily incorporate entity descriptions and gold mention labels if available. We conducted extensive experiments on seven standard datasets spanning biomedical literature and clinical notes. Without using any labeled information, our method produces $\tt KRISSBERT$, a universal entity linker for four million UMLS entities that attains new state of the art, outperforming prior self-supervised methods by as much as 20 absolute points in accuracy. △ Less

Submitted 23 May, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

arXiv:2112.07869 [pdf, other]

Fine-Tuning Large Neural Language Models for Biomedical Natural Language Processing

Authors: Robert Tinn, Hao Cheng, Yu Gu, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, Hoifung Poon

Abstract: Motivation: A perennial challenge for biomedical researchers and clinical practitioners is to stay abreast with the rapid growth of publications and medical notes. Natural language processing (NLP) has emerged as a promising direction for taming information overload. In particular, large neural language models facilitate transfer learning by pretraining on unlabeled text, as exemplified by the suc… ▽ More Motivation: A perennial challenge for biomedical researchers and clinical practitioners is to stay abreast with the rapid growth of publications and medical notes. Natural language processing (NLP) has emerged as a promising direction for taming information overload. In particular, large neural language models facilitate transfer learning by pretraining on unlabeled text, as exemplified by the successes of BERT models in various NLP applications. However, fine-tuning such models for an end task remains challenging, especially with small labeled datasets, which are common in biomedical NLP. Results: We conduct a systematic study on fine-tuning stability in biomedical NLP. We show that finetuning performance may be sensitive to pretraining settings, especially in low-resource domains. Large models have potential to attain better performance, but increasing model size also exacerbates finetuning instability. We thus conduct a comprehensive exploration of techniques for addressing fine-tuning instability. We show that these techniques can substantially improve fine-tuning performance for lowresource biomedical NLP applications. Specifically, freezing lower layers is helpful for standard BERT-BASE models, while layerwise decay is more effective for BERT-LARGE and ELECTRA models. For low-resource text similarity tasks such as BIOSSES, reinitializing the top layer is the optimal strategy. Overall, domainspecific vocabulary and pretraining facilitate more robust models for fine-tuning. Based on these findings, we establish new state of the art on a wide range of biomedical NLP applications. Availability and implementation: To facilitate progress in biomedical NLP, we release our state-of-the-art pretrained and fine-tuned models: https://aka.ms/BLURB. △ Less

Submitted 14 December, 2021; originally announced December 2021.

arXiv:2112.01120 [pdf, other]

Impact of jet-production data on the next-to-next-to-leading-order determination of HERAPDF2.0 parton distributions

Authors: H1, ZEUS Collaborations, :, I. Abt, R. Aggarwal, V. Andreev, M. Arratia, V. Aushev, A. Baghdasaryan, A. Baty, K. Begzsuren, O. Behnke, A. Belousov, A. Bertolin, I. Bloch, V. Boudry, G. Brandt, I. Brock, N. H. Brook, R. Brugnera, A. Bruni, A. Buniatyan, P. J. Bussey, L. Bystritskaya, A. Caldwell , et al. (212 additional authors not shown)

Abstract: The HERAPDF2.0 ensemble of parton distribution functions (PDFs) was introduced in 2015. The final stage is presented, a next-to-next-to-leading-order (NNLO) analysis of the HERA data on inclusive deep inelastic $ep$ scattering together with jet data as published by the H1 and ZEUS collaborations. A perturbative QCD fit, simultaneously of $α_s(M_Z^2)$ and and the PDFs, was performed with the result… ▽ More The HERAPDF2.0 ensemble of parton distribution functions (PDFs) was introduced in 2015. The final stage is presented, a next-to-next-to-leading-order (NNLO) analysis of the HERA data on inclusive deep inelastic $ep$ scattering together with jet data as published by the H1 and ZEUS collaborations. A perturbative QCD fit, simultaneously of $α_s(M_Z^2)$ and and the PDFs, was performed with the result $α_s(M_Z^2) = 0.1156 \pm 0.0011~{\rm (exp)}~ ^{+0.0001}_{-0.0002}~ {\rm (model}$ ${\rm +~parameterisation)}~ \pm 0.0029~{\rm (scale)}$. The PDF sets of HERAPDF2.0Jets NNLO were determined with separate fits using two fixed values of $α_s(M_Z^2)$, $α_s(M_Z^2)=0.1155$ and $0.118$, since the latter value was already chosen for the published HERAPDF2.0 NNLO analysis based on HERA inclusive DIS data only. The different sets of PDFs are presented, evaluated and compared. The consistency of the PDFs determined with and without the jet data demonstrates the consistency of HERA inclusive and jet-production cross-section data. The inclusion of the jet data reduced the uncertainty on the gluon PDF. Predictions based on the PDFs of HERAPDF2.0Jets NNLO give an excellent description of the jet-production data used as input. △ Less

Submitted 2 December, 2021; originally announced December 2021.

Comments: 43 pages, 24 figures, to be submitted to Eur. Phys. J. C

Report number: DESY-21-206

arXiv:2109.11836 [pdf]

B20-MnSi films grown on Si(100) substrates with magnetic skyrmion signature

Authors: Zichao Li, Ye Yuan, René Hübner, Viktor Begeza, Thomas Naumann, Lars Rebohle, Olav Hellwig, Manfred Helm, Kornelius Nielsch, Slawomir Prucnal, Shengqiang Zhou

Abstract: Magnetic skyrmions have been suggested as information carriers for future spintronic devices. As the first material with experimentally confirmed skyrmions, B20-type MnSi was the research focus for decades. Although B20-MnSi films have been successfully grown on Si(111) substrates, there is no report about B20-MnSi films on Si(100) substrates, which would be more preferred for practical applicatio… ▽ More Magnetic skyrmions have been suggested as information carriers for future spintronic devices. As the first material with experimentally confirmed skyrmions, B20-type MnSi was the research focus for decades. Although B20-MnSi films have been successfully grown on Si(111) substrates, there is no report about B20-MnSi films on Si(100) substrates, which would be more preferred for practical applications. In this letter, we present the first preparation of B20-MnSi on Si(100) substrates. It is realized by sub-second solid-state reaction between Mn and Si via flash-lamp annealing at ambient pressure. The regrown layer shows an enhanced Curie temperature of 43 K compared with bulk B20-MnSi. The magnetic skyrmion signature is proved in our films by magnetic and transport measurements. The millisecond-range flash annealing provides a promising avenue for the fabrication of Si-based skyrmionic devices. △ Less

Submitted 24 September, 2021; originally announced September 2021.

Comments: 17 pages, accepted at Materials Today Physics

arXiv:2109.05362 [pdf, other]

Modular Self-Supervision for Document-Level Relation Extraction

Authors: Sheng Zhang, Cliff Wong, Naoto Usuyama, Sarthak Jain, Tristan Naumann, Hoifung Poon

Abstract: Extracting relations across large text spans has been relatively underexplored in NLP, but it is particularly important for high-value domains such as biomedicine, where obtaining high recall of the latest findings is crucial for practical applications. Compared to conventional information extraction confined to short text spans, document-level relation extraction faces additional challenges in bo… ▽ More Extracting relations across large text spans has been relatively underexplored in NLP, but it is particularly important for high-value domains such as biomedicine, where obtaining high recall of the latest findings is crucial for practical applications. Compared to conventional information extraction confined to short text spans, document-level relation extraction faces additional challenges in both inference and learning. Given longer text spans, state-of-the-art neural architectures are less effective and task-specific self-supervision such as distant supervision becomes very noisy. In this paper, we propose decomposing document-level relation extraction into relation detection and argument resolution, taking inspiration from Davidsonian semantics. This enables us to incorporate explicit discourse modeling and leverage modular self-supervision for each sub-problem, which is less noise-prone and can be further refined end-to-end via variational EM. We conduct a thorough evaluation in biomedical machine reading for precision oncology, where cross-paragraph relation mentions are prevalent. Our method outperforms prior state of the art, such as multi-scale learning and graph neural networks, by over 20 absolute F1 points. The gain is particularly pronounced among the most challenging relation instances whose arguments never co-occur in a paragraph. △ Less

Submitted 11 September, 2021; originally announced September 2021.

Comments: EMNLP 2021

arXiv:2108.12376 [pdf, other]

Measurement of lepton-jet correlation in deep-inelastic scattering with the H1 detector using machine learning for unfolding

Authors: H1 Collaboration, V. Andreev, M. Arratia, A. Baghdasaryan, A. Baty, K. Begzsuren, A. Belousov, A. Bolz, V. Boudry, G. Brandt, D. Britzger, A. Buniatyan, L. Bystritskaya, A. J. Campbell, K. B. Cantun Avila, K. Cerny, V. Chekelian, Z. Chen, J. G. Contreras, L. Cunqueiro Mendez, J. Cvach, J. B. Dainton, K. Daum, A. Deshpande, C. Diaconu , et al. (120 additional authors not shown)

Abstract: The first measurement of lepton-jet momentum imbalance and azimuthal correlation in lepton-proton scattering at high momentum transfer is presented. These data, taken with the H1 detector at HERA, are corrected for detector effects using an unbinned machine learning algorithm OmniFold, which considers eight observables simultaneously in this first application. The unfolded cross sections are compa… ▽ More The first measurement of lepton-jet momentum imbalance and azimuthal correlation in lepton-proton scattering at high momentum transfer is presented. These data, taken with the H1 detector at HERA, are corrected for detector effects using an unbinned machine learning algorithm OmniFold, which considers eight observables simultaneously in this first application. The unfolded cross sections are compared to calculations performed within the context of collinear or transverse-momentum-dependent (TMD) factorization in Quantum Chromodynamics (QCD) as well as Monte Carlo event generators. The measurement probes a wide range of QCD phenomena, including TMD parton distribution functions and their evolution with energy in so far unexplored kinematic regions. △ Less

Submitted 1 April, 2022; v1 submitted 27 August, 2021; originally announced August 2021.

Comments: 17 pages, 7 figures, 4 tables, version accepted by PRL

Report number: DESY 21-130

arXiv:2106.13375 [pdf, other]

doi 10.1145/3447548.3469053

Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature

Authors: Yu Wang, Jinchao Li, Tristan Naumann, Chenyan Xiong, Hao Cheng, Robert Tinn, Cliff Wong, Naoto Usuyama, Richard Rogahn, Zhihong Shen, Yang Qin, Eric Horvitz, Paul N. Bennett, Jianfeng Gao, Hoifung Poon

Abstract: Information overload is a prevalent challenge in many high-value domains. A prominent case in point is the explosion of the biomedical literature on COVID-19, which swelled to hundreds of thousands of papers in a matter of months. In general, biomedical literature expands by two papers every minute, totalling over a million new papers every year. Search in the biomedical realm, and many other vert… ▽ More Information overload is a prevalent challenge in many high-value domains. A prominent case in point is the explosion of the biomedical literature on COVID-19, which swelled to hundreds of thousands of papers in a matter of months. In general, biomedical literature expands by two papers every minute, totalling over a million new papers every year. Search in the biomedical realm, and many other vertical domains is challenging due to the scarcity of direct supervision from click logs. Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck. We propose a general approach for vertical search based on domain-specific pretraining and present a case study for the biomedical domain. Despite being substantially simpler and not using any relevance labels for training or development, our method performs comparably or better than the best systems in the official TREC-COVID evaluation, a COVID-related biomedical search competition. Using distributed computing in modern cloud infrastructure, our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search, a new search experience for biomedical literature: https://aka.ms/biomedsearch. △ Less

Submitted 16 September, 2021; v1 submitted 24 June, 2021; originally announced June 2021.

Comments: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2021 Applied Data Science Track

arXiv:2007.15779 [pdf, other]

doi 10.1145/3458754

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Authors: Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, Hoifung Poon

Abstract: Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by s… ▽ More Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition (NER). To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB. △ Less

Submitted 16 September, 2021; v1 submitted 30 July, 2020; originally announced July 2020.

Comments: ACM Transactions on Computing for Healthcare (HEALTH)

arXiv:2002.01584

ML4H Abstract Track 2019

Authors: Matthew B. A. McDermott, Emily Alsentzer, Sam Finlayson, Michael Oberst, Fabian Falck, Tristan Naumann, Brett K. Beaulieu-Jones, Adrian V. Dalca

Abstract: A collection of the accepted abstracts for the Machine Learning for Health (ML4H) workshop at NeurIPS 2019. This index is not complete, as some accepted abstracts chose to opt-out of inclusion. A collection of the accepted abstracts for the Machine Learning for Health (ML4H) workshop at NeurIPS 2019. This index is not complete, as some accepted abstracts chose to opt-out of inclusion. △ Less

Submitted 4 February, 2020; originally announced February 2020.

arXiv:1912.04370 [pdf, other]

Cross-Language Aphasia Detection using Optimal Transport Domain Adaptation

Authors: Aparna Balagopalan, Jekaterina Novikova, Matthew B. A. McDermott, Bret Nestor, Tristan Naumann, Marzyeh Ghassemi

Abstract: Multi-language speech datasets are scarce and often have small sample sizes in the medical domain. Robust transfer of linguistic features across languages could improve rates of early diagnosis and therapy for speakers of low-resource languages when detecting health conditions from speech. We utilize out-of-domain, unpaired, single-speaker, healthy speech data for training multiple Optimal Transpo… ▽ More Multi-language speech datasets are scarce and often have small sample sizes in the medical domain. Robust transfer of linguistic features across languages could improve rates of early diagnosis and therapy for speakers of low-resource languages when detecting health conditions from speech. We utilize out-of-domain, unpaired, single-speaker, healthy speech data for training multiple Optimal Transport (OT) domain adaptation systems. We learn mappings from other languages to English and detect aphasia from linguistic characteristics of speech, and show that OT domain adaptation improves aphasia detection over unilingual baselines for French (6% increased F1) and Mandarin (5% increased F1). Further, we show that adding aphasic data to the domain adaptation system significantly increases performance for both French and Mandarin, increasing the F1 scores further (10% and 8% increase in F1 scores for French and Mandarin, respectively, over unilingual baselines). △ Less

Submitted 4 December, 2019; originally announced December 2019.

Comments: Accepted to ML4H at NeurIPS 2019

arXiv:1908.00690 [pdf, other]

Feature Robustness in Non-stationary Health Records: Caveats to Deployable Model Performance in Common Clinical Machine Learning Tasks

Authors: Bret Nestor, Matthew B. A. McDermott, Willie Boag, Gabriela Berner, Tristan Naumann, Michael C. Hughes, Anna Goldenberg, Marzyeh Ghassemi

Abstract: When training clinical prediction models from electronic health records (EHRs), a key concern should be a model's ability to sustain performance over time when deployed, even as care practices, database systems, and population demographics evolve. Due to de-identification requirements, however, current experimental practices for public EHR benchmarks (such as the MIMIC-III critical care dataset) a… ▽ More When training clinical prediction models from electronic health records (EHRs), a key concern should be a model's ability to sustain performance over time when deployed, even as care practices, database systems, and population demographics evolve. Due to de-identification requirements, however, current experimental practices for public EHR benchmarks (such as the MIMIC-III critical care dataset) are time agnostic, assigning care records to train or test sets without regard for the actual dates of care. As a result, current benchmarks cannot assess how well models trained on one year generalise to another. In this work, we obtain a Limited Data Use Agreement to access year of care for each record in MIMIC and show that all tested state-of-the-art models decay in prediction quality when trained on historical data and tested on future data, particularly in response to a system-wide record-keeping change in 2008 (0.29 drop in AUROC for mortality prediction, 0.10 drop in AUROC for length-of-stay prediction with a random forest classifier). We further develop a simple yet effective mitigation strategy: by aggregating raw features into expert-defined clinical concepts, we see only a 0.06 drop in AUROC for mortality prediction and a 0.03 drop in AUROC for length-of-stay prediction. We demonstrate that this aggregation strategy outperforms other automatic feature preprocessing techniques aimed at increasing robustness to data drift. We release our aggregated representations and code to encourage more deployable clinical prediction models. △ Less

Submitted 1 August, 2019; originally announced August 2019.

arXiv:1907.08322 [pdf, other]

doi 10.1145/3368555.3384469

MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III

Authors: Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Michael C. Hughes, Tristan Naumann, Marzyeh Ghassemi

Abstract: Robust machine learning relies on access to data that can be used with standardized frameworks in important tasks and the ability to develop models whose performance can be reasonably reproduced. In machine learning for healthcare, the community faces reproducibility challenges due to a lack of publicly accessible data and a lack of standardized data processing frameworks. We present MIMIC-Extract… ▽ More Robust machine learning relies on access to data that can be used with standardized frameworks in important tasks and the ability to develop models whose performance can be reasonably reproduced. In machine learning for healthcare, the community faces reproducibility challenges due to a lack of publicly accessible data and a lack of standardized data processing frameworks. We present MIMIC-Extract, an open-source pipeline for transforming raw electronic health record (EHR) data for critical care patients contained in the publicly-available MIMIC-III database into dataframes that are directly usable in common machine learning pipelines. MIMIC-Extract addresses three primary challenges in making complex health records data accessible to the broader machine learning community. First, it provides standardized data processing functions, including unit conversion, outlier detection, and aggregating semantically equivalent features, thus accounting for duplication and reducing missingness. Second, it preserves the time series nature of clinical data and can be easily integrated into clinically actionable prediction tasks in machine learning for health. Finally, it is highly extensible so that other researchers with related questions can easily use the same pipeline. We demonstrate the utility of this pipeline by showcasing several benchmark tasks and baseline results. △ Less

Submitted 19 August, 2020; v1 submitted 18 July, 2019; originally announced July 2019.

arXiv:1904.03323 [pdf, other]

Publicly Available Clinical BERT Embeddings

Authors: Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, Matthew B. A. McDermott

Abstract: Contextual word embedding models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been minimally explored on specialty corpora, such as clinical text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this… ▽ More Contextual word embedding models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been minimally explored on specialty corpora, such as clinical text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this work, we address this need by exploring and releasing BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically. We demonstrate that using a domain-specific model yields performance improvements on three common clinical NLP tasks as compared to nonspecific embeddings. These domain-specific models are not as performant on two clinical de-identification tasks, and argue that this is a natural consequence of the differences between de-identified source text and synthetically non de-identified task text. △ Less

Submitted 20 June, 2019; v1 submitted 5 April, 2019; originally announced April 2019.

Comments: Clinical Natural Language Processing (ClinicalNLP) Workshop at NAACL 2019

arXiv:1812.02275 [pdf, other]

Generalizability of predictive models for intensive care unit patients

Authors: Alistair E. W. Johnson, Tom J. Pollard, Tristan Naumann

Abstract: A large volume of research has considered the creation of predictive models for clinical data; however, much existing literature reports results using only a single source of data. In this work, we evaluate the performance of models trained on the publicly-available eICU Collaborative Research Database. We show that cross-validation using many distinct centers provides a reasonable estimate of mod… ▽ More A large volume of research has considered the creation of predictive models for clinical data; however, much existing literature reports results using only a single source of data. In this work, we evaluate the performance of models trained on the publicly-available eICU Collaborative Research Database. We show that cross-validation using many distinct centers provides a reasonable estimate of model performance in new centers. We further show that a single model trained across centers transfers well to distinct hospitals, even compared to a model retrained using hospital-specific data. Our results motivate the use of multi-center datasets for model development and highlight the need for data sharing among hospitals to maximize model performance. △ Less

Submitted 5 December, 2018; originally announced December 2018.

Comments: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:1811.07216

Report number: ML4H/2018/233

arXiv:1811.12583 [pdf, other]

Rethinking clinical prediction: Why machine learning must consider year of care and feature aggregation

Authors: Bret Nestor, Matthew B. A. McDermott, Geeticka Chauhan, Tristan Naumann, Michael C. Hughes, Anna Goldenberg, Marzyeh Ghassemi

Abstract: Machine learning for healthcare often trains models on de-identified datasets with randomly-shifted calendar dates, ignoring the fact that data were generated under hospital operation practices that change over time. These changing practices induce definitive changes in observed data which confound evaluations which do not account for dates and limit the generalisability of date-agnostic models. I… ▽ More Machine learning for healthcare often trains models on de-identified datasets with randomly-shifted calendar dates, ignoring the fact that data were generated under hospital operation practices that change over time. These changing practices induce definitive changes in observed data which confound evaluations which do not account for dates and limit the generalisability of date-agnostic models. In this work, we establish the magnitude of this problem on MIMIC, a public hospital dataset, and showcase a simple solution. We augment MIMIC with the year in which care was provided and show that a model trained using standard feature representations will significantly degrade in quality over time. We find a deterioration of 0.3 AUC when evaluating mortality prediction on data from 10 years later. We find a similar deterioration of 0.15 AUC for length-of-stay. In contrast, we demonstrate that clinically-oriented aggregates of raw features significantly mitigate future deterioration. Our suggested aggregated representations, when retrained yearly, have prediction quality comparable to year-agnostic models. △ Less

Submitted 29 November, 2018; originally announced November 2018.

Comments: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:1811.07216

Report number: ML4H/2018/189

arXiv:1811.07216

Machine Learning for Health (ML4H) Workshop at NeurIPS 2018

Authors: Natalia Antropova, Andrew L. Beam, Brett K. Beaulieu-Jones, Irene Chen, Corey Chivers, Adrian Dalca, Sam Finlayson, Madalina Fiterau, Jason Alan Fries, Marzyeh Ghassemi, Mike Hughes, Bruno Jedynak, Jasvinder S. Kandola, Matthew McDermott, Tristan Naumann, Peter Schulam, Farah Shamout, Alexandre Yahi

Abstract: This volume represents the accepted submissions from the Machine Learning for Health (ML4H) workshop at the conference on Neural Information Processing Systems (NeurIPS) 2018, held on December 8, 2018 in Montreal, Canada. This volume represents the accepted submissions from the Machine Learning for Health (ML4H) workshop at the conference on Neural Information Processing Systems (NeurIPS) 2018, held on December 8, 2018 in Montreal, Canada. △ Less

Submitted 24 November, 2018; v1 submitted 17 November, 2018; originally announced November 2018.

Comments: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:1811.07216

arXiv:1806.04820 [pdf]

Natural Language Processing for EHR-Based Computational Phenotyping

Authors: Zexian Zeng, Yu Deng, Xiaoyu Li, Tristan Naumann, Yuan Luo

Abstract: This article reviews recent advances in applying natural language processing (NLP) to Electronic Health Records (EHRs) for computational phenotyping. NLP-based computational phenotyping has numerous applications including diagnosis categorization, novel phenotype discovery, clinical trial screening, pharmacogenomics, drug-drug interaction (DDI) and adverse drug event (ADE) detection, as well as ge… ▽ More This article reviews recent advances in applying natural language processing (NLP) to Electronic Health Records (EHRs) for computational phenotyping. NLP-based computational phenotyping has numerous applications including diagnosis categorization, novel phenotype discovery, clinical trial screening, pharmacogenomics, drug-drug interaction (DDI) and adverse drug event (ADE) detection, as well as genome-wide and phenome-wide association studies. Significant progress has been made in algorithm development and resource construction for computational phenotyping. Among the surveyed methods, well-designed keyword search and rule-based systems often achieve good performance. However, the construction of keyword and rule lists requires significant manual effort, which is difficult to scale. Supervised machine learning models have been favored because they are capable of acquiring both classification patterns and structures from data. Recently, deep learning and unsupervised learning have received growing attention, with the former favored for its performance and the latter for its ability to find novel phenotypes. Integrating heterogeneous data sources have become increasingly important and have shown promise in improving model performance. Often better performance is achieved by combining multiple modalities of information. Despite these many advances, challenges and opportunities remain for NLP-based computational phenotyping, including better model interpretability and generalizability, and proper characterization of feature relations in clinical narratives △ Less

Submitted 14 June, 2018; v1 submitted 12 June, 2018; originally announced June 2018.

arXiv:1806.00397 [pdf, other]

Visualizing Patient Timelines in the Intensive Care Unit

Authors: Dina Levy-Lambert, Jen J. Gong, Tristan Naumann, Tom J. Pollard, John V. Guttag

Abstract: Electronic Health Records (EHRs) contain a large volume of heterogeneous patient data, which are useful at the point of care and for retrospective research. These data are typically stored in relational databases. Gaining an integrated view of these data for a single patient typically requires complex SQL queries joining multiple tables. In this work, we present a visualization tool that integrate… ▽ More Electronic Health Records (EHRs) contain a large volume of heterogeneous patient data, which are useful at the point of care and for retrospective research. These data are typically stored in relational databases. Gaining an integrated view of these data for a single patient typically requires complex SQL queries joining multiple tables. In this work, we present a visualization tool that integrates heterogeneous health care data (e.g., clinical notes, laboratory test values, vital signs) into a single timeline. We train risk models offline and dynamically generate and present their predictions alongside patient data. Our visualization is designed to enable users to understand the heterogeneous temporal data quickly and comprehensively, and to place the output of analytic models in the context of the underlying data. △ Less

Submitted 1 June, 2018; originally announced June 2018.

arXiv:1806.00388 [pdf]

A Review of Challenges and Opportunities in Machine Learning for Health

Authors: Marzyeh Ghassemi, Tristan Naumann, Peter Schulam, Andrew L. Beam, Irene Y. Chen, Rajesh Ranganath

Abstract: Modern electronic health records (EHRs) provide data to answer clinically meaningful questions. The growing data in EHRs makes healthcare ripe for the use of machine learning. However, learning in a clinical setting presents unique challenges that complicate the use of common machine learning methodologies. For example, diseases in EHRs are poorly labeled, conditions can encompass multiple underly… ▽ More Modern electronic health records (EHRs) provide data to answer clinically meaningful questions. The growing data in EHRs makes healthcare ripe for the use of machine learning. However, learning in a clinical setting presents unique challenges that complicate the use of common machine learning methodologies. For example, diseases in EHRs are poorly labeled, conditions can encompass multiple underlying endotypes, and healthy individuals are underrepresented. This article serves as a primer to illuminate these challenges and highlights opportunities for members of the machine learning community to contribute to healthcare. △ Less

Submitted 5 December, 2019; v1 submitted 1 June, 2018; originally announced June 2018.

Comments: Updated version

arXiv:1803.02728 [pdf, other]

Towards the Creation of a Large Corpus of Synthetically-Identified Clinical Notes

Authors: Willie Boag, Tristan Naumann, Peter Szolovits

Abstract: Clinical notes often describe the most important aspects of a patient's physiology and are therefore critical to medical research. However, these notes are typically inaccessible to researchers without prior removal of sensitive protected health information (PHI), a natural language processing (NLP) task referred to as deidentification. Tools to automatically de-identify clinical notes are needed… ▽ More Clinical notes often describe the most important aspects of a patient's physiology and are therefore critical to medical research. However, these notes are typically inaccessible to researchers without prior removal of sensitive protected health information (PHI), a natural language processing (NLP) task referred to as deidentification. Tools to automatically de-identify clinical notes are needed but are difficult to create without access to those very same notes containing PHI. This work presents a first step toward creating a large synthetically-identified corpus of clinical notes and corresponding PHI annotations in order to facilitate the development de-identification tools. Further, one such tool is evaluated against this corpus in order to understand the advantages and shortcomings of this approach. △ Less

Submitted 7 March, 2018; originally announced March 2018.

arXiv:1803.02245 [pdf, other]

CliNER 2.0: Accessible and Accurate Clinical Concept Extraction

Authors: Willie Boag, Elena Sergeeva, Saurabh Kulshreshtha, Peter Szolovits, Anna Rumshisky, Tristan Naumann

Abstract: Clinical notes often describe important aspects of a patient's stay and are therefore critical to medical research. Clinical concept extraction (CCE) of named entities - such as problems, tests, and treatments - aids in forming an understanding of notes and provides a foundation for many downstream clinical decision-making tasks. Historically, this task has been posed as a standard named entity re… ▽ More Clinical notes often describe important aspects of a patient's stay and are therefore critical to medical research. Clinical concept extraction (CCE) of named entities - such as problems, tests, and treatments - aids in forming an understanding of notes and provides a foundation for many downstream clinical decision-making tasks. Historically, this task has been posed as a standard named entity recognition (NER) sequence tagging problem, and solved with feature-based methods using handengineered domain knowledge. Recent advances, however, have demonstrated the efficacy of LSTM-based models for NER tasks, including CCE. This work presents CliNER 2.0, a simple-to-install, open-source tool for extracting concepts from clinical text. CliNER 2.0 uses a word- and character- level LSTM model, and achieves state-of-the-art performance. For ease of use, the tool also includes pre-trained models available for public use. △ Less

Submitted 6 March, 2018; originally announced March 2018.

arXiv:1709.07251 [pdf, other]

Determination of the strong coupling constant $α_s(M_Z)$ in next-to-next-to-leading order QCD using H1 jet cross section measurements

Authors: H1 collaboration, V. Andreev, A. Baghdasaryan, K. Begzsuren, A. Belousov, V. Bertone, A. Bolz, V. Boudry, G. Brandt, V. Brisson, D. Britzger, A. Buniatyan, A. Bylinkin, L. Bystritskaya, A. J. Campbell, K. B. Cantun Avila, K. Cerny, V. Chekelian, J. G. Contreras, J. Cvach, J. Currie, J. B. Dainton, K. Daum, C. Diaconu, M. Dobre , et al. (123 additional authors not shown)

Abstract: The strong coupling constant $α_s(M_Z)$ is determined from inclusive jet and dijet cross sections in neutral-current deep-inelastic $ep$ scattering (DIS) measured at HERA by the H1 collaboration using next-to-next-to-leading order (NNLO) QCD predictions. The dependence of the NNLO predictions and of the resulting value of $α_s(M_Z)$ at the $Z$-boson mass $m_Z$ are studied as a function of the choi… ▽ More The strong coupling constant $α_s(M_Z)$ is determined from inclusive jet and dijet cross sections in neutral-current deep-inelastic $ep$ scattering (DIS) measured at HERA by the H1 collaboration using next-to-next-to-leading order (NNLO) QCD predictions. The dependence of the NNLO predictions and of the resulting value of $α_s(M_Z)$ at the $Z$-boson mass $m_Z$ are studied as a function of the choice of the renormalisation and factorisation scales. Using inclusive jet and dijet data together, the strong coupling constant is determined to be $α_s(M_Z)=0.1166\,(19)_{\rm exp}\,(24)_{\rm th}$. Complementary, $α_s(M_Z)$ is determined together with parton distribution functions of the proton (PDFs) from jet and inclusive DIS data measured by the H1 experiment. The value $α_s(M_Z)=0.1147\,(25)_{\rm tot}$ obtained is consistent with the determination from jet data alone. The impact of the jet data on the PDFs is studied. The running of the strong coupling is tested at different values of the renormalisation scale and the results are found to be in agreement with expectations. △ Less

Submitted 16 June, 2021; v1 submitted 21 September, 2017; originally announced September 2017.

Comments: 45 pages, 17 figures, with changes discussed in an erratum submitted to EPJ C

Report number: DESY17-137

arXiv:1705.08863 [pdf, ps, other]

doi 10.1016/j.physletb.2017.11.002

Running of the Charm-Quark Mass from HERA Deep-Inelastic Scattering Data

Authors: A. Gizhko, A. Geiser, S. Moch, I. Abt, O. Behnke, A. Bertolin, J. Blümlein, D. Britzger, R. Brugnera, A. Buniatyan, P. J. Bussey, R. Carlin, A. M. Cooper-Sarkar, K. Daum, S. Dusini, E. Elsen, L. Favart, J. Feltesse, B. Foster, A. Garfagnini, M. Garzelli, J. Gayler, D. Haidt, J. Hladky, A. W. Jung , et al. (25 additional authors not shown)

Abstract: Combined HERA data on charm production in deep-inelastic scattering have previously been used to determine the charm-quark running mass $m_c(m_c)$ in the MSbar renormalisation scheme. Here, the same data are used as a function of the photon virtuality $Q^2$ to evaluate the charm-quark running mass at different scales to one-loop order, in the context of a next-to-leading order QCD analysis. The sc… ▽ More Combined HERA data on charm production in deep-inelastic scattering have previously been used to determine the charm-quark running mass $m_c(m_c)$ in the MSbar renormalisation scheme. Here, the same data are used as a function of the photon virtuality $Q^2$ to evaluate the charm-quark running mass at different scales to one-loop order, in the context of a next-to-leading order QCD analysis. The scale dependence of the mass is found to be consistent with QCD expectations. △ Less

Submitted 24 May, 2017; originally announced May 2017.

Comments: 12 pages, 4 figures

Report number: DESY-17-048

arXiv:0901.0512 [pdf]

Expected Performance of the ATLAS Experiment - Detector, Trigger and Physics

Authors: The ATLAS Collaboration, G. Aad, E. Abat, B. Abbott, J. Abdallah, A. A. Abdelalim, A. Abdesselam, O. Abdinov, B. Abi, M. Abolins, H. Abramowicz, B. S. Acharya, D. L. Adams, T. N. Addy, C. Adorisio, P. Adragna, T. Adye, J. A. Aguilar-Saavedra, M. Aharrouche, S. P. Ahlen, F. Ahles, A. Ahmad, H. Ahmed, G. Aielli, T. Akdogan , et al. (2587 additional authors not shown)

Abstract: A detailed study is presented of the expected performance of the ATLAS detector. The reconstruction of tracks, leptons, photons, missing energy and jets is investigated, together with the performance of b-tagging and the trigger. The physics potential for a variety of interesting physics processes, within the Standard Model and beyond, is examined. The study comprises a series of notes based on… ▽ More A detailed study is presented of the expected performance of the ATLAS detector. The reconstruction of tracks, leptons, photons, missing energy and jets is investigated, together with the performance of b-tagging and the trigger. The physics potential for a variety of interesting physics processes, within the Standard Model and beyond, is examined. The study comprises a series of notes based on simulations of the detector and physics processes, with particular emphasis given to the data expected from the first years of operation of the LHC at CERN. △ Less

Submitted 14 August, 2009; v1 submitted 28 December, 2008; originally announced January 2009.

arXiv:hep-ph/9605276 [pdf, ps, other]

doi 10.1016/0370-2693(96)00976-8

On the Asymptotic Behaviour of F_2(x,Q^2)

Authors: A. DeRoeck, M. Klein, Th. Naumann

Abstract: We discuss how the proton structure function F2 is described in the HERA kinematic range by double asymptotic expressions for low x and large Q^2. From a NLO double asymptotic approximation of recent data from the H1 experiment at HERA we extract the strong coupling constant alpha_S(M^2_Z)=0.113+/-0.002(stat)+/-0.007(syst). The additional theoretical error can be as large as 0.007. We discuss how the proton structure function F2 is described in the HERA kinematic range by double asymptotic expressions for low x and large Q^2. From a NLO double asymptotic approximation of recent data from the H1 experiment at HERA we extract the strong coupling constant alpha_S(M^2_Z)=0.113+/-0.002(stat)+/-0.007(syst). The additional theoretical error can be as large as 0.007. △ Less

Submitted 10 May, 1996; originally announced May 1996.

Comments: 4 pages, latex, 1 Figures appended as uuencoded file

Report number: DESY 96-063

Journal ref: Phys.Lett. B385 (1996) 411-414

Showing 1–47 of 47 results for author: Naumann, T