subscribe to arXiv mailings

Gemma: Open Models Based on Gemini Research and Technology

Authors: Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari , et al. (83 additional authors not shown)

Abstract: This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Ge… ▽ More This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations. △ Less

Submitted 16 April, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

arXiv:2403.05530 [pdf, other]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content. △ Less

Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2403.02164 [pdf]

Cognition is All You Need -- The Next Layer of AI Above Large Language Models

Authors: Nova Spivack, Sam Douglas, Michelle Crames, Tim Connors

Abstract: Recent studies of the applications of conversational AI tools, such as chatbots powered by large language models, to complex real-world knowledge work have shown limitations related to reasoning and multi-step problem solving. Specifically, while existing chatbots simulate shallow reasoning and understanding they are prone to errors as problem complexity increases. The failure of these systems to… ▽ More Recent studies of the applications of conversational AI tools, such as chatbots powered by large language models, to complex real-world knowledge work have shown limitations related to reasoning and multi-step problem solving. Specifically, while existing chatbots simulate shallow reasoning and understanding they are prone to errors as problem complexity increases. The failure of these systems to address complex knowledge work is due to the fact that they do not perform any actual cognition. In this position paper, we present Cognitive AI, a higher-level framework for implementing programmatically defined neuro-symbolic cognition above and outside of large language models. Specifically, we propose a dual-layer functional architecture for Cognitive AI that serves as a roadmap for AI systems that can perform complex multi-step knowledge work. We propose that Cognitive AI is a necessary precursor for the evolution of higher forms of AI, such as AGI, and specifically claim that AGI cannot be achieved by probabilistic approaches on their own. We conclude with a discussion of the implications for large language models, adoption cycles in AI, and commercial Cognitive AI development. △ Less

Submitted 5 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

Comments: 63 pages, 18 figures

ACM Class: I.2.0

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.02179 [pdf, other]

Training Chain-of-Thought via Latent-Variable Inference

Authors: Du Phan, Matthew D. Hoffman, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A. Saurous

Abstract: Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training se… ▽ More Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training set. Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers; these rationales are expensive to produce by hand. Instead, we propose a fine-tuning strategy that tries to maximize the \emph{marginal} log-likelihood of generating a correct answer using CoT prompting, approximately averaging over all possible rationales. The core challenge is sampling from the posterior over rationales conditioned on the correct answer; we address it using a simple Markov-chain Monte Carlo (MCMC) expectation-maximization (EM) algorithm inspired by the self-taught reasoner (STaR), memoized wake-sleep, Markovian score climbing, and persistent contrastive divergence. This algorithm also admits a novel control-variate technique that drives the variance of our gradient estimates to zero as the model improves. Applying our technique to GSM8K and the tasks in BIG-Bench Hard, we find that this MCMC-EM fine-tuning technique typically improves the model's accuracy on held-out examples more than STaR or prompt-tuning with or without CoT. △ Less

Submitted 28 November, 2023; originally announced December 2023.

Comments: 23 pages, 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

arXiv:2307.08753 [pdf, other]

A Novel Application of Conditional Normalizing Flows: Stellar Age Inference with Gyrochronology

Authors: Phil Van-Lane, Joshua S. Speagle, Stephanie Douglas

Abstract: Stellar ages are critical building blocks of evolutionary models, but challenging to measure for low mass main sequence stars. An unexplored solution in this regime is the application of probabilistic machine learning methods to gyrochronology, a stellar dating technique that is uniquely well suited for these stars. While accurate analytical gyrochronological models have proven challenging to deve… ▽ More Stellar ages are critical building blocks of evolutionary models, but challenging to measure for low mass main sequence stars. An unexplored solution in this regime is the application of probabilistic machine learning methods to gyrochronology, a stellar dating technique that is uniquely well suited for these stars. While accurate analytical gyrochronological models have proven challenging to develop, here we apply conditional normalizing flows to photometric data from open star clusters, and demonstrate that a data-driven approach can constrain gyrochronological ages with a precision comparable to other standard techniques. We evaluate the flow results in the context of a Bayesian framework, and show that our inferred ages recover literature values well. This work demonstrates the potential of a probabilistic data-driven solution to widen the applicability of gyrochronological stellar dating. △ Less

Submitted 17 July, 2023; originally announced July 2023.

Comments: Accepted at the ICML 2023 Workshop on Machine Learning for Astrophysics. 10 pages, 3 figures (+1 in appendices)

ACM Class: J.2.0

arXiv:2211.05102 [pdf, other]

Efficiently Scaling Transformer Inference

Authors: Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean

Abstract: We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a sim… ▽ More We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices based on the application requirements. We combine these with a suite of low-level optimizations to achieve a new Pareto frontier on the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parameter models that outperforms the FasterTransformer suite of benchmarks. We further show that with appropriate partitioning, the lower memory requirements of multiquery attention (i.e. multiple query heads share single key/value head) enables scaling up to 32x larger context lengths. Finally, we achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens, while supporting a long 2048-token context length on the PaLM 540B parameter model. △ Less

Submitted 9 November, 2022; originally announced November 2022.

arXiv:2112.04495 [pdf, other]

doi 10.1016/j.media.2022.102730

Dynamic multi feature-class Gaussian process models

Authors: Jean-Rassaire Fouefack, Bhushan Borotikar, Marcel Lüthi, Tania S. Douglas, Valérie Burdin, Tinashe E. M. Mutsvangwa

Abstract: In model-based medical image analysis, three features of interest are the shape of structures of interest, their relative pose, and image intensity profiles representative of some physical property. Often, these are modelled separately through statistical models by decomposing the object's features into a set of basis functions through principal geodesic analysis or principal component analysis. T… ▽ More In model-based medical image analysis, three features of interest are the shape of structures of interest, their relative pose, and image intensity profiles representative of some physical property. Often, these are modelled separately through statistical models by decomposing the object's features into a set of basis functions through principal geodesic analysis or principal component analysis. This study presents a statistical modelling method for automatic learning of shape, pose and intensity features in medical images which we call the Dynamic multi feature-class Gaussian process models (DMFC-GPM). A DMFC-GPM is a Gaussian process (GP)-based model with a shared latent space that encodes linear and non-linear variation. Our method is defined in a continuous domain with a principled way to represent shape, pose and intensity feature classes in a linear space, based on deformation fields. A deformation field-based metric is adapted in the method for modelling shape and intensity feature variation as well as for comparing rigid transformations (pose). Moreover, DMFC-GPMs inherit properties intrinsic to GPs including marginalisation and regression. Furthermore, they allow for adding additional pose feature variability on top of those obtained from the image acquisition process; what we term as permutation modelling. For image analysis tasks using DMFC-GPMs, we adapt Metropolis-Hastings algorithms making the prediction of features fully probabilistic. We validate the method using controlled synthetic data and we perform experiments on bone structures from CT images of the shoulder to illustrate the efficacy of the model for pose and shape feature prediction. The model performance results suggest that this new modelling paradigm is robust, accurate, accessible, and has potential applications including the management of musculoskeletal disorders and clinical decision making △ Less

Submitted 8 December, 2021; originally announced December 2021.

Comments: 16

arXiv:2103.15569 [pdf, other]

Risk Bounds for Learning via Hilbert Coresets

Authors: Spencer Douglas, Piyush Kumar, R. K. Prasanth

Abstract: We develop a formalism for constructing stochastic upper bounds on the expected full sample risk for supervised classification tasks via the Hilbert coresets approach within a transductive framework. We explicitly compute tight and meaningful bounds for complex datasets and complex hypothesis classes such as state-of-the-art deep neural network architectures. The bounds we develop exhibit nice pro… ▽ More We develop a formalism for constructing stochastic upper bounds on the expected full sample risk for supervised classification tasks via the Hilbert coresets approach within a transductive framework. We explicitly compute tight and meaningful bounds for complex datasets and complex hypothesis classes such as state-of-the-art deep neural network architectures. The bounds we develop exhibit nice properties: i) the bounds are non-uniform in the hypothesis space, ii) in many practical examples, the bounds become effectively deterministic by appropriate choice of prior and training data-dependent posterior distributions on the hypothesis space, and iii) the bounds become significantly better with increase in the size of the training set. We also lay out some ideas to explore for future research. △ Less

Submitted 29 March, 2021; originally announced March 2021.

Comments: 16 pages, 2 figures

ACM Class: F.2.1; F.2.3

arXiv:2101.11775 [pdf, ps, other]

Moral and Social Ramifications of Autonomous Vehicles

Authors: Veljko Dubljević, Sean Douglas, Jovan Milojevich, Nirav Ajmeri, William A. Bauer, George F. List, Munindar P. Singh

Abstract: Autonomous Vehicles (AVs) raise important social and ethical concerns, especially about accountability, dignity, and justice. We focus on the specific concerns arising from how AV technology will affect the lives and livelihoods of professional and semi-professional drivers. Whereas previous studies of such concerns have focused on the opinions of experts, we seek to understand these ethical and s… ▽ More Autonomous Vehicles (AVs) raise important social and ethical concerns, especially about accountability, dignity, and justice. We focus on the specific concerns arising from how AV technology will affect the lives and livelihoods of professional and semi-professional drivers. Whereas previous studies of such concerns have focused on the opinions of experts, we seek to understand these ethical and societal challenges from the perspectives of the drivers themselves. To this end, we adopted a qualitative research methodology based on semi-structured interviews. This is an established social science methodology that helps understand the core concerns of stakeholders in depth by avoiding the biases of superficial methods such as surveys. We find that whereas drivers agree with the experts that AVs will significantly impact transportation systems, they are apprehensive about the prospects for their livelihoods and dismiss the suggestions that driving jobs are unsatisfying and their profession does not merit protection. By showing how drivers differ from the experts, our study has ramifications beyond AVs to AI and other advanced technologies. Our findings suggest that qualitative research applied to the relevant, especially disempowered, stakeholders is essential to ensuring that new technologies are introduced ethically. △ Less

Submitted 29 January, 2021; v1 submitted 27 January, 2021; originally announced January 2021.

arXiv:2001.07904 [pdf, other]

doi 10.1007/978-3-030-59719-1_73

Dynamic multi-object Gaussian process models: A framework for data-driven functional modelling of human joints

Authors: Jean-Rassaire Fouefack, Bhushan Borotikar, Tania S. Douglas, Valérie Burdin, Tinashe E. M. Mutsvangwa

Abstract: Statistical shape models (SSMs) are state-of-the-art medical image analysis tools for extracting and explaining features across a set of biological structures. However, a principled and robust way to combine shape and pose features has been illusive due to three main issues: 1) Non-homogeneity of the data (data with linear and non-linear natural variation across features), 2) non-optimal represent… ▽ More Statistical shape models (SSMs) are state-of-the-art medical image analysis tools for extracting and explaining features across a set of biological structures. However, a principled and robust way to combine shape and pose features has been illusive due to three main issues: 1) Non-homogeneity of the data (data with linear and non-linear natural variation across features), 2) non-optimal representation of the $3D$ motion (rigid transformation representations that are not proportional to the kinetic energy that move an object from one position to the other), and 3) artificial discretization of the models. In this paper, we propose a new framework for dynamic multi-object statistical modelling framework for the analysis of human joints in a continuous domain. Specifically, we propose to normalise shape and dynamic spatial features in the same linearized statistical space permitting the use of linear statistics; we adopt an optimal 3D motion representation for more accurate rigid transformation comparisons; and we provide a 3D shape and pose prediction protocol using a Markov chain Monte Carlo sampling-based fitting. The framework affords an efficient generative dynamic multi-object modelling platform for biological joints. We validate the framework using a controlled synthetic data. Finally, the framework is applied to an analysis of the human shoulder joint to compare its performance with standard SSM approaches in prediction of shape while adding the advantage of determining relative pose between bones in a complex. Excellent validity is observed and the shoulder joint shape-pose prediction results suggest that the novel framework may have utility for a range of medical image analysis applications. Furthermore, the framework is generic and can be extended to n$>$2 objects, making it suitable for clinical and diagnostic methods for the management of joint disorders. △ Less

Submitted 22 January, 2020; originally announced January 2020.

Comments: 15 pages, 14 figures

arXiv:1812.05981 [pdf, other]

Why ReLU Units Sometimes Die: Analysis of Single-Unit Error Backpropagation in Neural Networks

Authors: Scott C. Douglas, Jiutian Yu

Abstract: Recently, neural networks in machine learning use rectified linear units (ReLUs) in early processing layers for better performance. Training these structures sometimes results in "dying ReLU units" with near-zero outputs. We first explore this condition via simulation using the CIFAR-10 dataset and variants of two popular convolutive neural network architectures. Our explorations show that the out… ▽ More Recently, neural networks in machine learning use rectified linear units (ReLUs) in early processing layers for better performance. Training these structures sometimes results in "dying ReLU units" with near-zero outputs. We first explore this condition via simulation using the CIFAR-10 dataset and variants of two popular convolutive neural network architectures. Our explorations show that the output activation probability Pr[y>0] is generally less than 0.5 at system convergence for layers that do not employ skip connections, and this activation probability tends to decrease as one progresses from input layer to output layer. Employing a simplified model of a single ReLU unit trained by a variant of error backpropagation, we then perform a statistical convergence analysis to explore the model's evolutionary behavior. Our analysis describes the potentially-slower convergence speeds of dying ReLU units, and this issue can occur regardless of how the weights are initialized. △ Less

Submitted 14 December, 2018; originally announced December 2018.

Comments: 5 pages, 7 figures, Proc. 52nd Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, October 2018

arXiv:cs/0410009 [pdf, ps, other]

Diffusive Load Balancing of Loosely-Synchronous Parallel Programs over Peer-to-Peer Networks

Authors: Scott Douglas, Aaron Harwood

Abstract: The use of under-utilized Internet resources is widely recognized as a viable form of high performance computing. Sustained processing power of roughly 40T FLOPS using 4 million volunteered Internet hosts has been reported for embarrassingly parallel problems. At the same time, peer-to-peer (P2P) file sharing networks, with more than 50 million participants, have demonstrated the capacity for sc… ▽ More The use of under-utilized Internet resources is widely recognized as a viable form of high performance computing. Sustained processing power of roughly 40T FLOPS using 4 million volunteered Internet hosts has been reported for embarrassingly parallel problems. At the same time, peer-to-peer (P2P) file sharing networks, with more than 50 million participants, have demonstrated the capacity for scale in distributed systems. This paper contributes a study of load balancing techniques for a general class of loosely-synchronous parallel algorithms when executed over a P2P network. We show that decentralized, diffusive load balancing can be effective at balancing load and is facilitated by the dynamic properties of P2P. While a moderate degree of dynamicity can benefit load balancing, significant dynamicity hinders the parallel program performance due to the need for increased load migration. To the best of our knowledge this study provides new insight into the performance of loosely-synchronous parallel programs over the Internet. △ Less

Submitted 5 October, 2004; originally announced October 2004.

Comments: 14 pages with 10 figures

Showing 1–13 of 13 results for author: Douglas, S