-
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Authors:
Marah Abdin,
Sam Ade Jacobs,
Ammar Ahmad Awan,
Jyoti Aneja,
Ahmed Awadallah,
Hany Awadalla,
Nguyen Bach,
Amit Bahree,
Arash Bakhtiari,
Jianmin Bao,
Harkirat Behl,
Alon Benhaim,
Misha Bilenko,
Johan Bjorck,
Sébastien Bubeck,
Qin Cai,
Martin Cai,
Caio César Teodoro Mendes,
Weizhu Chen,
Vishrav Chaudhary,
Dong Chen,
Dongdong Chen,
Yen-Chun Chen,
Yi-Ling Chen,
Parul Chopra
, et al. (90 additional authors not shown)
Abstract:
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset…
▽ More
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench). Moreover, we also introduce phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts.
△ Less
Submitted 23 May, 2024; v1 submitted 22 April, 2024;
originally announced April 2024.
-
Language Is Not All You Need: Aligning Perception with Language Models
Authors:
Shaohan Huang,
Li Dong,
Wenhui Wang,
Yaru Hao,
Saksham Singhal,
Shuming Ma,
Tengchao Lv,
Lei Cui,
Owais Khan Mohammed,
Barun Patra,
Qiang Liu,
Kriti Aggarwal,
Zewen Chi,
Johan Bjorck,
Vishrav Chaudhary,
Subhojit Som,
Xia Song,
Furu Wei
Abstract:
A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal co…
▽ More
A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.
△ Less
Submitted 1 March, 2023; v1 submitted 27 February, 2023;
originally announced February 2023.
-
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Authors:
Wenhui Wang,
Hangbo Bao,
Li Dong,
Johan Bjorck,
Zhiliang Peng,
Qiang Liu,
Kriti Aggarwal,
Owais Khan Mohammed,
Saksham Singhal,
Subhojit Som,
Furu Wei
Abstract:
A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Mult…
▽ More
A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains state-of-the-art performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).
△ Less
Submitted 30 August, 2022; v1 submitted 22 August, 2022;
originally announced August 2022.
-
Is High Variance Unavoidable in RL? A Case Study in Continuous Control
Authors:
Johan Bjorck,
Carla P. Gomes,
Kilian Q. Weinberger
Abstract:
Reinforcement learning (RL) experiments have notoriously high variance, and minor details can have disproportionately large effects on measured outcomes. This is problematic for creating reproducible research and also serves as an obstacle for real-world applications, where safety and predictability are paramount. In this paper, we investigate causes for this perceived instability. To allow for an…
▽ More
Reinforcement learning (RL) experiments have notoriously high variance, and minor details can have disproportionately large effects on measured outcomes. This is problematic for creating reproducible research and also serves as an obstacle for real-world applications, where safety and predictability are paramount. In this paper, we investigate causes for this perceived instability. To allow for an in-depth analysis, we focus on a specifically popular setup with high variance -- continuous control from pixels with an actor-critic agent. In this setting, we demonstrate that variance mostly arises early in training as a result of poor "outlier" runs, but that weight initialization and initial exploration are not to blame. We show that one cause for early variance is numerical instability which leads to saturating nonlinearities. We investigate several fixes to this issue and find that one particular method is surprisingly effective and simple -- normalizing penultimate features. Addressing the learning instability allows for larger learning rates, and significantly decreases the variance of outcomes. This demonstrates that the perceived variance in RL is not necessarily inherent to the problem definition and may be addressed through simple architectural modifications.
△ Less
Submitted 5 February, 2022; v1 submitted 21 October, 2021;
originally announced October 2021.
-
Towards Deeper Deep Reinforcement Learning with Spectral Normalization
Authors:
Johan Bjorck,
Carla P. Gomes,
Kilian Q. Weinberger
Abstract:
In computer vision and natural language processing, innovations in model architecture that increase model capacity have reliably translated into gains in performance. In stark contrast with this trend, state-of-the-art reinforcement learning (RL) algorithms often use small MLPs, and gains in performance typically originate from algorithmic innovations. It is natural to hypothesize that small datas…
▽ More
In computer vision and natural language processing, innovations in model architecture that increase model capacity have reliably translated into gains in performance. In stark contrast with this trend, state-of-the-art reinforcement learning (RL) algorithms often use small MLPs, and gains in performance typically originate from algorithmic innovations. It is natural to hypothesize that small datasets in RL necessitate simple models to avoid overfitting; however, this hypothesis is untested. In this paper we investigate how RL agents are affected by exchanging the small MLPs with larger modern networks with skip connections and normalization, focusing specifically on actor-critic algorithms. We empirically verify that naively adopting such architectures leads to instabilities and poor performance, likely contributing to the popularity of simple models in practice. However, we show that dataset size is not the limiting factor, and instead argue that instability from taking gradients through the critic is the culprit. We demonstrate that spectral normalization (SN) can mitigate this issue and enable stable training with large modern architectures. After smoothing with SN, larger models yield significant performance improvements -- suggesting that more "easy" gains may be had by focusing on model architectures in addition to algorithmic innovations.
△ Less
Submitted 3 January, 2022; v1 submitted 2 June, 2021;
originally announced June 2021.
-
Low-Precision Reinforcement Learning: Running Soft Actor-Critic in Half Precision
Authors:
Johan Bjorck,
Xiangyu Chen,
Christopher De Sa,
Carla P. Gomes,
Kilian Q. Weinberger
Abstract:
Low-precision training has become a popular approach to reduce compute requirements, memory footprint, and energy consumption in supervised learning. In contrast, this promising approach has not yet enjoyed similarly widespread adoption within the reinforcement learning (RL) community, partly because RL agents can be notoriously hard to train even in full precision. In this paper we consider conti…
▽ More
Low-precision training has become a popular approach to reduce compute requirements, memory footprint, and energy consumption in supervised learning. In contrast, this promising approach has not yet enjoyed similarly widespread adoption within the reinforcement learning (RL) community, partly because RL agents can be notoriously hard to train even in full precision. In this paper we consider continuous control with the state-of-the-art SAC agent and demonstrate that a naïve adaptation of low-precision methods from supervised learning fails. We propose a set of six modifications, all straightforward to implement, that leaves the underlying agent and its hyperparameters unchanged but improves the numerical stability dramatically. The resulting modified SAC agent has lower memory and compute requirements while matching full-precision rewards, demonstrating that low-precision training can substantially accelerate state-of-the-art RL without parameter tuning.
△ Less
Submitted 3 June, 2021; v1 submitted 26 February, 2021;
originally announced February 2021.
-
Understanding Decoupled and Early Weight Decay
Authors:
Johan Bjorck,
Kilian Weinberger,
Carla Gomes
Abstract:
Weight decay (WD) is a traditional regularization technique in deep learning, but despite its ubiquity, its behavior is still an area of active research. Golatkar et al. have recently shown that WD only matters at the start of the training in computer vision, upending traditional wisdom. Loshchilov et al. show that for adaptive optimizers, manually decaying weights can outperform adding an $l_2$ p…
▽ More
Weight decay (WD) is a traditional regularization technique in deep learning, but despite its ubiquity, its behavior is still an area of active research. Golatkar et al. have recently shown that WD only matters at the start of the training in computer vision, upending traditional wisdom. Loshchilov et al. show that for adaptive optimizers, manually decaying weights can outperform adding an $l_2$ penalty to the loss. This technique has become increasingly popular and is referred to as decoupled WD. The goal of this paper is to investigate these two recent empirical observations. We demonstrate that by applying WD only at the start, the network norm stays small throughout training. This has a regularizing effect as the effective gradient updates become larger. However, traditional generalizations metrics fail to capture this effect of WD, and we show how a simple scale-invariant metric can. We also show how the growth of network weights is heavily influenced by the dataset and its generalization properties. For decoupled WD, we perform experiments in NLP and RL where adaptive optimizers are the norm. We demonstrate that the primary issue that decoupled WD alleviates is the mixing of gradients from the objective function and the $l_2$ penalty in the buffers of Adam (which stores the estimates of the first-order moment). Adaptivity itself is not problematic and decoupled WD ensures that the gradients from the $l_2$ term cannot "drown out" the true objective, facilitating easier hyperparameter tuning.
△ Less
Submitted 26 December, 2020;
originally announced December 2020.
-
Automatic Detection and Compression for Passive Acoustic Monitoring of the African Forest Elephant
Authors:
Johan Bjorck,
Brendan H. Rappazzo,
Di Chen,
Richard Bernstein,
Peter H. Wrege,
Carla P. Gomes
Abstract:
In this work, we consider applying machine learning to the analysis and compression of audio signals in the context of monitoring elephants in sub-Saharan Africa. Earth's biodiversity is increasingly under threat by sources of anthropogenic change (e.g. resource extraction, land use change, and climate change) and surveying animal populations is critical for developing conservation strategies. How…
▽ More
In this work, we consider applying machine learning to the analysis and compression of audio signals in the context of monitoring elephants in sub-Saharan Africa. Earth's biodiversity is increasingly under threat by sources of anthropogenic change (e.g. resource extraction, land use change, and climate change) and surveying animal populations is critical for developing conservation strategies. However, manually monitoring tropical forests or deep oceans is intractable. For species that communicate acoustically, researchers have argued for placing audio recorders in the habitats as a cost-effective and non-invasive method, a strategy known as passive acoustic monitoring (PAM). In collaboration with conservation efforts, we construct a large labeled dataset of passive acoustic recordings of the African Forest Elephant via crowdsourcing, compromising thousands of hours of recordings in the wild. Using state-of-the-art techniques in artificial intelligence we improve upon previously proposed methods for passive acoustic monitoring for classification and segmentation. In real-time detection of elephant calls, network bandwidth quickly becomes a bottleneck and efficient ways to compress the data are needed. Most audio compression schemes are aimed at human listeners and are unsuitable for low-frequency elephant calls. To remedy this, we provide a novel end-to-end differentiable method for compression of audio signals that can be adapted to acoustic monitoring of any species and dramatically improves over naive coding strategies.
△ Less
Submitted 24 February, 2019;
originally announced February 2019.
-
Understanding Batch Normalization
Authors:
Johan Bjorck,
Carla Gomes,
Bart Selman,
Kilian Q. Weinberger
Abstract:
Batch normalization (BN) is a technique to normalize activations in intermediate layers of deep neural networks. Its tendency to improve accuracy and speed up training have established BN as a favorite technique in deep learning. Yet, despite its enormous success, there remains little consensus on the exact reason and mechanism behind these improvements. In this paper we take a step towards a bett…
▽ More
Batch normalization (BN) is a technique to normalize activations in intermediate layers of deep neural networks. Its tendency to improve accuracy and speed up training have established BN as a favorite technique in deep learning. Yet, despite its enormous success, there remains little consensus on the exact reason and mechanism behind these improvements. In this paper we take a step towards a better understanding of BN, following an empirical approach. We conduct several experiments, and show that BN primarily enables training with larger learning rates, which is the cause for faster convergence and better generalization. For networks without BN we demonstrate how large gradient updates can result in diverging loss and activations growing uncontrollably with network depth, which limits possible learning rates. BN avoids this problem by constantly correcting activations to be zero-mean and of unit standard deviation, which enables larger gradient steps, yields faster convergence and may help bypass sharp local minima. We further show various ways in which gradients and activations of deep unnormalized networks are ill-behaved. We contrast our results against recent findings in random matrix theory, shedding new light on classical initialization schemes and their consequences.
△ Less
Submitted 30 November, 2018; v1 submitted 31 May, 2018;
originally announced June 2018.
-
Scalable Relaxations of Sparse Packing Constraints: Optimal Biocontrol in Predator-Prey Network
Authors:
Johan Bjorck,
Yiwei Bai,
Xiaojian Wu,
Yexiang Xue,
Mark C. Whitmore,
Carla Gomes
Abstract:
Cascades represent rapid changes in networks. A cascading phenomenon of ecological and economic impact is the spread of invasive species in geographic landscapes. The most promising management strategy is often biocontrol, which entails introducing a natural predator able to control the invading population, a setting that can be treated as two interacting cascades of predator and prey populations.…
▽ More
Cascades represent rapid changes in networks. A cascading phenomenon of ecological and economic impact is the spread of invasive species in geographic landscapes. The most promising management strategy is often biocontrol, which entails introducing a natural predator able to control the invading population, a setting that can be treated as two interacting cascades of predator and prey populations. We formulate and study a nonlinear problem of optimal biocontrol: optimally seeding the predator cascade over time to minimize the harmful prey population. Recurring budgets, which typically face conservation organizations, naturally leads to sparse constraints which make the problem amenable to approximation algorithms. Available methods based on continuous relaxations scale poorly, to remedy this we develop a novel and scalable randomized algorithm based on a width relaxation, applicable to a broad class of combinatorial optimization problems. We evaluate our contributions in the context of biocontrol for the insect pest Hemlock Wolly Adelgid (HWA) in eastern North America. Our algorithm outperforms competing methods in terms of scalability and solution quality, and finds near optimal strategies for the control of the HWA for fine-grained networks -- an important problem in computational sustainability.
△ Less
Submitted 8 February, 2018; v1 submitted 17 November, 2017;
originally announced November 2017.
-
Surface acoustic wave unidirectional transducers for quantum applications
Authors:
Maria K. Ekström,
Thomas Aref,
Johan Runeson,
Johan Björck,
Isac Boström,
Per Delsing
Abstract:
The conversion efficiency of electric microwave signals into surface acoustic waves in different types of superconducting transducers is studied with the aim of quantum applications. We compare delay lines containing either conventional symmetric transducers (IDTs) or unidirectional transducers (UDTs) at 2.3 GHz and 10 mK. The UDT delay lines improve the insertion loss with 4.7 dB and a directivit…
▽ More
The conversion efficiency of electric microwave signals into surface acoustic waves in different types of superconducting transducers is studied with the aim of quantum applications. We compare delay lines containing either conventional symmetric transducers (IDTs) or unidirectional transducers (UDTs) at 2.3 GHz and 10 mK. The UDT delay lines improve the insertion loss with 4.7 dB and a directivity of 22 dB is found for each UDT, indicating that 99.4 % of the acoustic power goes in the desired direction. The power lost in the undesired direction accounts for more than 90 % of the total loss in IDT delay lines, but only ~3 % percent of the total loss in the FEUDT delay lines.
△ Less
Submitted 18 November, 2016;
originally announced November 2016.
-
Automated Phase Mapping with AgileFD and its Application to Light Absorber Discovery in the V-Mn-Nb Oxide System
Authors:
Santosh K. Suram,
Yexiang Xue,
Junwen Bai,
Ronan Le Bras,
Brendan Rappazzo,
Richard Bernstein,
Johan Bjorck,
Lan Zhou,
Robert B. van Dover,
Carla P. Gomes,
John M. Gregoire
Abstract:
Rapid construction of phase diagrams is a central tenet of combinatorial materials science with accelerated materials discovery efforts often hampered by challenges in interpreting combinatorial x-ray diffraction datasets, which we address by developing AgileFD, an artificial intelligence algorithm that enables rapid phase mapping from a combinatorial library of x-ray diffraction patterns. AgileFD…
▽ More
Rapid construction of phase diagrams is a central tenet of combinatorial materials science with accelerated materials discovery efforts often hampered by challenges in interpreting combinatorial x-ray diffraction datasets, which we address by developing AgileFD, an artificial intelligence algorithm that enables rapid phase mapping from a combinatorial library of x-ray diffraction patterns. AgileFD models alloying-based peak shifting through a novel expansion of convolutional nonnegative matrix factorization, which not only improves the identification of constituent phases but also maps their concentration and lattice parameter as a function of composition. By incorporating Gibbs phase rule into the algorithm, physically meaningful phase maps are obtained with unsupervised operation, and more refined solutions are attained by injecting expert knowledge of the system. The algorithm is demonstrated through investigation of the V-Mn-Nb oxide system where decomposition of eight oxide phases, including two with substantial alloying, provides the first phase map for this pseudo-ternary system. This phase map enables interpretation of high-throughput band gap data, leading to the discovery of new solar light absorbers and the alloying-based tuning of the direct-allowed band-gap energy of MnV2O6. The open-source family of AgileFD algorithms can be implemented into a broad range of high throughput workflows to accelerate materials discovery.
△ Less
Submitted 6 October, 2016;
originally announced October 2016.
-
Phase-Mapper: An AI Platform to Accelerate High Throughput Materials Discovery
Authors:
Yexiang Xue,
Junwen Bai,
Ronan Le Bras,
Brendan Rappazzo,
Richard Bernstein,
Johan Bjorck,
Liane Longpre,
Santosh K. Suram,
Robert B. van Dover,
John Gregoire,
Carla P. Gomes
Abstract:
High-Throughput materials discovery involves the rapid synthesis, measurement, and characterization of many different but structurally-related materials. A key problem in materials discovery, the phase map identification problem, involves the determination of the crystal phase diagram from the materials' composition and structural characterization data. We present Phase-Mapper, a novel AI platform…
▽ More
High-Throughput materials discovery involves the rapid synthesis, measurement, and characterization of many different but structurally-related materials. A key problem in materials discovery, the phase map identification problem, involves the determination of the crystal phase diagram from the materials' composition and structural characterization data. We present Phase-Mapper, a novel AI platform to solve the phase map identification problem that allows humans to interact with both the data and products of AI algorithms, including the incorporation of human feedback to constrain or initialize solutions. Phase-Mapper affords incorporation of any spectral demixing algorithm, including our novel solver, AgileFD, which is based on a convolutive non-negative matrix factorization algorithm. AgileFD can incorporate constraints to capture the physics of the materials as well as human feedback. We compare three solver variants with previously proposed methods in a large-scale experiment involving 20 synthetic systems, demonstrating the efficacy of imposing physical constrains using AgileFD. Phase-Mapper has also been used by materials scientists to solve a wide variety of phase diagrams, including the previously unsolved Nb-Mn-V oxide system, which is provided here as an illustrative example.
△ Less
Submitted 7 October, 2016; v1 submitted 3 October, 2016;
originally announced October 2016.