-
Highway Networks
Authors:
Rupesh Kumar Srivastava,
Klaus Greff,
Jürgen Schmidhuber
Abstract:
There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult with increasing depth and training of very deep networks remains an open problem. In this extended abstract, we introduce a new architecture designed to ease gradient-based training of very deep networks. We refer to network…
▽ More
There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult with increasing depth and training of very deep networks remains an open problem. In this extended abstract, we introduce a new architecture designed to ease gradient-based training of very deep networks. We refer to networks with this architecture as highway networks, since they allow unimpeded information flow across several layers on "information highways". The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions, opening up the possibility of studying extremely deep and efficient architectures.
△ Less
Submitted 3 November, 2015; v1 submitted 2 May, 2015;
originally announced May 2015.
-
LSTM: A Search Space Odyssey
Authors:
Klaus Greff,
Rupesh Kumar Srivastava,
Jan Koutník,
Bas R. Steunebrink,
Jürgen Schmidhuber
Abstract:
Several variants of the Long Short-Term Memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art models for a variety of machine learning problems. This has led to a renewed interest in understanding the role and utility of various computational components of typical LSTM variants. In t…
▽ More
Several variants of the Long Short-Term Memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art models for a variety of machine learning problems. This has led to a renewed interest in understanding the role and utility of various computational components of typical LSTM variants. In this paper, we present the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling. The hyperparameters of all LSTM variants for each task were optimized separately using random search, and their importance was assessed using the powerful fANOVA framework. In total, we summarize the results of 5400 experimental runs ($\approx 15$ years of CPU time), which makes our study the largest of its kind on LSTM networks. Our results show that none of the variants can improve upon the standard LSTM architecture significantly, and demonstrate the forget gate and the output activation function to be its most critical components. We further observe that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.
△ Less
Submitted 4 October, 2017; v1 submitted 13 March, 2015;
originally announced March 2015.
-
Assessment of algorithms for mitosis detection in breast cancer histopathology images
Authors:
Mitko Veta,
Paul J. van Diest,
Stefan M. Willems,
Haibo Wang,
Anant Madabhushi,
Angel Cruz-Roa,
Fabio Gonzalez,
Anders B. L. Larsen,
Jacob S. Vestergaard,
Anders B. Dahl,
Dan C. Cireşan,
Jürgen Schmidhuber,
Alessandro Giusti,
Luca M. Gambardella,
F. Boray Tek,
Thomas Walter,
Ching-Wei Wang,
Satoshi Kondo,
Bogdan J. Matuszewski,
Frederic Precioso,
Violet Snell,
Josef Kittler,
Teofilo E. de Campos,
Adnan M. Khan,
Nasir M. Rajpoot
, et al. (4 additional authors not shown)
Abstract:
The proliferative activity of breast tumors, which is routinely estimated by counting of mitotic figures in hematoxylin and eosin stained histology sections, is considered to be one of the most important prognostic markers. However, mitosis counting is laborious, subjective and may suffer from low inter-observer agreement. With the wider acceptance of whole slide images in pathology labs, automati…
▽ More
The proliferative activity of breast tumors, which is routinely estimated by counting of mitotic figures in hematoxylin and eosin stained histology sections, is considered to be one of the most important prognostic markers. However, mitosis counting is laborious, subjective and may suffer from low inter-observer agreement. With the wider acceptance of whole slide images in pathology labs, automatic image analysis has been proposed as a potential solution for these issues. In this paper, the results from the Assessment of Mitosis Detection Algorithms 2013 (AMIDA13) challenge are described. The challenge was based on a data set consisting of 12 training and 11 testing subjects, with more than one thousand annotated mitotic figures by multiple observers. Short descriptions and results from the evaluation of eleven methods are presented. The top performing method has an error rate that is comparable to the inter-observer agreement among pathologists.
△ Less
Submitted 21 November, 2014;
originally announced November 2014.
-
Understanding Locally Competitive Networks
Authors:
Rupesh Kumar Srivastava,
Jonathan Masci,
Faustino Gomez,
Jürgen Schmidhuber
Abstract:
Recently proposed neural network activation functions such as rectified linear, maxout, and local winner-take-all have allowed for faster and more effective training of deep neural architectures on large and complex datasets. The common trait among these functions is that they implement local competition between small groups of computational units within a layer, so that only part of the network i…
▽ More
Recently proposed neural network activation functions such as rectified linear, maxout, and local winner-take-all have allowed for faster and more effective training of deep neural architectures on large and complex datasets. The common trait among these functions is that they implement local competition between small groups of computational units within a layer, so that only part of the network is activated for any given input pattern. In this paper, we attempt to visualize and understand this self-modularization, and suggest a unified explanation for the beneficial properties of such networks. We also show how our insights can be directly useful for efficiently performing retrieval over large datasets using neural networks.
△ Less
Submitted 8 April, 2015; v1 submitted 5 October, 2014;
originally announced October 2014.
-
Deep Networks with Internal Selective Attention through Feedback Connections
Authors:
Marijn Stollenga,
Jonathan Masci,
Faustino Gomez,
Juergen Schmidhuber
Abstract:
Traditional convolutional neural networks (CNN) are stationary and feedforward. They neither change their parameters during evaluation nor use feedback from higher to lower layers. Real brains, however, do. So does our Deep Attention Selective Network (dasNet) architecture. DasNets feedback structure can dynamically alter its convolutional filter sensitivities during classification. It harnesses t…
▽ More
Traditional convolutional neural networks (CNN) are stationary and feedforward. They neither change their parameters during evaluation nor use feedback from higher to lower layers. Real brains, however, do. So does our Deep Attention Selective Network (dasNet) architecture. DasNets feedback structure can dynamically alter its convolutional filter sensitivities during classification. It harnesses the power of sequential processing to improve classification performance, by allowing the network to iteratively focus its internal attention on some of its convolutional filters. Feedback is trained through direct policy search in a huge million-dimensional parameter space, through scalable natural evolution strategies (SNES). On the CIFAR-10 and CIFAR-100 datasets, dasNet outperforms the previous state-of-the-art model.
△ Less
Submitted 28 July, 2014; v1 submitted 11 July, 2014;
originally announced July 2014.
-
Deep Learning in Neural Networks: An Overview
Authors:
Juergen Schmidhuber
Abstract:
In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between…
▽ More
In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.
△ Less
Submitted 8 October, 2014; v1 submitted 30 April, 2014;
originally announced April 2014.
-
A Clockwork RNN
Authors:
Jan Koutník,
Klaus Greff,
Faustino Gomez,
Jürgen Schmidhuber
Abstract:
Sequence prediction and classification are ubiquitous and challenging problems in machine learning that can require identifying complex dependencies between temporally distant inputs. Recurrent Neural Networks (RNNs) have the ability, in theory, to cope with these temporal dependencies by virtue of the short-term memory implemented by their recurrent (feedback) connections. However, in practice th…
▽ More
Sequence prediction and classification are ubiquitous and challenging problems in machine learning that can require identifying complex dependencies between temporally distant inputs. Recurrent Neural Networks (RNNs) have the ability, in theory, to cope with these temporal dependencies by virtue of the short-term memory implemented by their recurrent (feedback) connections. However, in practice they are difficult to train successfully when the long-term memory is required. This paper introduces a simple, yet powerful modification to the standard RNN architecture, the Clockwork RNN (CW-RNN), in which the hidden layer is partitioned into separate modules, each processing inputs at its own temporal granularity, making computations only at its prescribed clock rate. Rather than making the standard RNN models more complex, CW-RNN reduces the number of RNN parameters, improves the performance significantly in the tasks tested, and speeds up the network evaluation. The network is demonstrated in preliminary experiments involving two tasks: audio signal generation and TIMIT spoken word classification, where it outperforms both RNN and LSTM networks.
△ Less
Submitted 14 February, 2014;
originally announced February 2014.
-
Bounded Recursive Self-Improvement
Authors:
E. Nivel,
K. R. Thórisson,
B. R. Steunebrink,
H. Dindo,
G. Pezzulo,
M. Rodriguez,
C. Hernandez,
D. Ognibene,
J. Schmidhuber,
R. Sanz,
H. P. Helgason,
A. Chella,
G. K. Jonsson
Abstract:
We have designed a machine that becomes increasingly better at behaving in underspecified circumstances, in a goal-directed way, on the job, by modeling itself and its environment as experience accumulates. Based on principles of autocatalysis, endogeny, and reflectivity, the work provides an architectural blueprint for constructing systems with high levels of operational autonomy in underspecifie…
▽ More
We have designed a machine that becomes increasingly better at behaving in underspecified circumstances, in a goal-directed way, on the job, by modeling itself and its environment as experience accumulates. Based on principles of autocatalysis, endogeny, and reflectivity, the work provides an architectural blueprint for constructing systems with high levels of operational autonomy in underspecified circumstances, starting from a small seed. Through value-driven dynamic priority scheduling controlling the parallel execution of a vast number of reasoning threads, the system achieves recursive self-improvement after it leaves the lab, within the boundaries imposed by its designers. A prototype system has been implemented and demonstrated to learn a complex real-world task, real-time multimodal dialogue with humans, by on-line observation. Our work presents solutions to several challenges that must be solved for achieving artificial general intelligence.
△ Less
Submitted 24 December, 2013;
originally announced December 2013.
-
My First Deep Learning System of 1991 + Deep Learning Timeline 1962-2013
Authors:
Jürgen Schmidhuber
Abstract:
Deep Learning has attracted significant attention in recent years. Here I present a brief overview of my first Deep Learner of 1991, and its historic context, with a timeline of Deep Learning highlights.
Deep Learning has attracted significant attention in recent years. Here I present a brief overview of my first Deep Learner of 1991, and its historic context, with a timeline of Deep Learning highlights.
△ Less
Submitted 19 December, 2013;
originally announced December 2013.
-
Multi-Column Deep Neural Networks for Offline Handwritten Chinese Character Classification
Authors:
Dan Cireşan,
Jürgen Schmidhuber
Abstract:
Our Multi-Column Deep Neural Networks achieve best known recognition rates on Chinese characters from the ICDAR 2011 and 2013 offline handwriting competitions, approaching human performance.
Our Multi-Column Deep Neural Networks achieve best known recognition rates on Chinese characters from the ICDAR 2011 and 2013 offline handwriting competitions, approaching human performance.
△ Less
Submitted 1 September, 2013;
originally announced September 2013.
-
Testing Hypotheses by Regularized Maximum Mean Discrepancy
Authors:
Somayeh Danafar,
Paola M. V. Rancoita,
Tobias Glasmachers,
Kevin Whittingstall,
Juergen Schmidhuber
Abstract:
Do two data samples come from different distributions? Recent studies of this fundamental problem focused on embedding probability distributions into sufficiently rich characteristic Reproducing Kernel Hilbert Spaces (RKHSs), to compare distributions by the distance between their embeddings. We show that Regularized Maximum Mean Discrepancy (RMMD), our novel measure for kernel-based hypothesis tes…
▽ More
Do two data samples come from different distributions? Recent studies of this fundamental problem focused on embedding probability distributions into sufficiently rich characteristic Reproducing Kernel Hilbert Spaces (RKHSs), to compare distributions by the distance between their embeddings. We show that Regularized Maximum Mean Discrepancy (RMMD), our novel measure for kernel-based hypothesis testing, yields substantial improvements even when sample sizes are small, and excels at hypothesis tests involving multiple comparisons with power control. We derive asymptotic distributions under the null and alternative hypotheses, and assess power control. Outstanding results are obtained on: challenging EEG data, MNIST, the Berkley Covertype, and the Flare-Solar dataset.
△ Less
Submitted 2 May, 2013;
originally announced May 2013.
-
Fast Image Scanning with Deep Max-Pooling Convolutional Neural Networks
Authors:
Alessandro Giusti,
Dan C. Cireşan,
Jonathan Masci,
Luca M. Gambardella,
Jürgen Schmidhuber
Abstract:
Deep Neural Networks now excel at image classification, detection and segmentation. When used to scan images by means of a sliding window, however, their high computational complexity can bring even the most powerful hardware to its knees. We show how dynamic programming can speedup the process by orders of magnitude, even when max-pooling layers are present.
Deep Neural Networks now excel at image classification, detection and segmentation. When used to scan images by means of a sliding window, however, their high computational complexity can bring even the most powerful hardware to its knees. We show how dynamic programming can speedup the process by orders of magnitude, even when max-pooling layers are present.
△ Less
Submitted 7 February, 2013;
originally announced February 2013.
-
A Fast Learning Algorithm for Image Segmentation with Max-Pooling Convolutional Networks
Authors:
Jonathan Masci,
Alessandro Giusti,
Dan Cireşan,
Gabriel Fricout,
Jürgen Schmidhuber
Abstract:
We present a fast algorithm for training MaxPooling Convolutional Networks to segment images. This type of network yields record-breaking performance in a variety of tasks, but is normally trained on a computationally expensive patch-by-patch basis. Our new method processes each training image in a single pass, which is vastly more efficient.
We validate the approach in different scenarios and r…
▽ More
We present a fast algorithm for training MaxPooling Convolutional Networks to segment images. This type of network yields record-breaking performance in a variety of tasks, but is normally trained on a computationally expensive patch-by-patch basis. Our new method processes each training image in a single pass, which is vastly more efficient.
We validate the approach in different scenarios and report a 1500-fold speed-up. In an application to automated steel defect detection and segmentation, we obtain excellent performance with short training times.
△ Less
Submitted 7 February, 2013;
originally announced February 2013.
-
A Frequency-Domain Encoding for Neuroevolution
Authors:
Jan Koutník,
Juergen Schmidhuber,
Faustino Gomez
Abstract:
Neuroevolution has yet to scale up to complex reinforcement learning tasks that require large networks. Networks with many inputs (e.g. raw video) imply a very high dimensional search space if encoded directly. Indirect methods use a more compact genotype representation that is transformed into networks of potentially arbitrary size. In this paper, we present an indirect method where networks are…
▽ More
Neuroevolution has yet to scale up to complex reinforcement learning tasks that require large networks. Networks with many inputs (e.g. raw video) imply a very high dimensional search space if encoded directly. Indirect methods use a more compact genotype representation that is transformed into networks of potentially arbitrary size. In this paper, we present an indirect method where networks are encoded by a set of Fourier coefficients which are transformed into network weight matrices via an inverse Fourier-type transform. Because there often exist network solutions whose weight matrices contain regularity (i.e. adjacent weights are correlated), the number of coefficients required to represent these networks in the frequency domain is much smaller than the number of weights (in the same way that natural images can be compressed by ignore high-frequency components). This "compressed" encoding is compared to the direct approach where search is conducted in the weight space on the high-dimensional octopus arm task. The results show that representing networks in the frequency domain can reduce the search-space dimensionality by as much as two orders of magnitude, both accelerating convergence and yielding more general solutions.
△ Less
Submitted 28 December, 2012;
originally announced December 2012.
-
A Learning Framework for Morphological Operators using Counter-Harmonic Mean
Authors:
Jonathan Masci,
Jesús Angulo,
Jürgen Schmidhuber
Abstract:
We present a novel framework for learning morphological operators using counter-harmonic mean. It combines concepts from morphology and convolutional neural networks. A thorough experimental validation analyzes basic morphological operators dilation and erosion, opening and closing, as well as the much more complex top-hat transform, for which we report a real-world application from the steel indu…
▽ More
We present a novel framework for learning morphological operators using counter-harmonic mean. It combines concepts from morphology and convolutional neural networks. A thorough experimental validation analyzes basic morphological operators dilation and erosion, opening and closing, as well as the much more complex top-hat transform, for which we report a real-world application from the steel industry. Using online learning and stochastic gradient descent, our system learns both the structuring element and the composition of operators. It scales well to large datasets and online settings.
△ Less
Submitted 11 December, 2012;
originally announced December 2012.
-
First Experiments with PowerPlay
Authors:
Rupesh Kumar Srivastava,
Bas R. Steunebrink,
Jürgen Schmidhuber
Abstract:
Like a scientist or a playing child, PowerPlay not only learns new skills to solve given problems, but also invents new interesting problems by itself. By design, it continually comes up with the fastest to find, initially novel, but eventually solvable tasks. It also continually simplifies or compresses or speeds up solutions to previous tasks. Here we describe first experiments with PowerPlay. A…
▽ More
Like a scientist or a playing child, PowerPlay not only learns new skills to solve given problems, but also invents new interesting problems by itself. By design, it continually comes up with the fastest to find, initially novel, but eventually solvable tasks. It also continually simplifies or compresses or speeds up solutions to previous tasks. Here we describe first experiments with PowerPlay. A self-delimiting recurrent neural network SLIM RNN is used as a general computational problem solving architecture. Its connection weights can encode arbitrary, self-delimiting, halting or non-halting programs affecting both environment (through effectors) and internal states encoding abstractions of event sequences. Our PowerPlay-driven SLIM RNN learns to become an increasingly general solver of self-invented problems, continually adding new problem solving procedures to its growing skill repertoire. Extending a recent conference paper, we identify interesting, emerging, developmental stages of our open-ended system. We also show how it automatically self-modularizes, frequently re-using code for previously invented skills, always trying to invent novel tasks that can be quickly validated because they do not require too many weight changes affecting too many previous tasks.
△ Less
Submitted 31 October, 2012;
originally announced October 2012.
-
Self-Delimiting Neural Networks
Authors:
Juergen Schmidhuber
Abstract:
Self-delimiting (SLIM) programs are a central concept of theoretical computer science, particularly algorithmic information & probability theory, and asymptotically optimal program search (AOPS). To apply AOPS to (possibly recurrent) neural networks (NNs), I introduce SLIM NNs. Neurons of a typical SLIM NN have threshold activation functions. During a computational episode, activations are spreadi…
▽ More
Self-delimiting (SLIM) programs are a central concept of theoretical computer science, particularly algorithmic information & probability theory, and asymptotically optimal program search (AOPS). To apply AOPS to (possibly recurrent) neural networks (NNs), I introduce SLIM NNs. Neurons of a typical SLIM NN have threshold activation functions. During a computational episode, activations are spreading from input neurons through the SLIM NN until the computation activates a special halt neuron. Weights of the NN's used connections define its program. Halting programs form a prefix code. The reset of the initial NN state does not cost more than the latest program execution. Since prefixes of SLIM programs influence their suffixes (weight changes occurring early in an episode influence which weights are considered later), SLIM NN learning algorithms (LAs) should execute weight changes online during activation spreading. This can be achieved by applying AOPS to growing SLIM NNs. To efficiently teach a SLIM NN to solve many tasks, such as correctly classifying many different patterns, or solving many different robot control tasks, each connection keeps a list of tasks it is used for. The lists may be efficiently updated during training. To evaluate the overall effect of currently tested weight changes, a SLIM NN LA needs to re-test performance only on the efficiently computable union of tasks potentially affected by the current weight changes. Future SLIM NNs will be implemented on 3-dimensional brain-like multi-processor hardware. Their LAs will minimize task-specific total wire length of used connections, to encourage efficient solutions of subtasks by subsets of neurons that are physically close. The novel class of SLIM NN LAs is currently being probed in ongoing experiments to be reported in separate papers.
△ Less
Submitted 29 September, 2012;
originally announced October 2012.
-
Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
Authors:
Yi Sun,
Faustino Gomez,
Juergen Schmidhuber
Abstract:
We present a new way of converting a reversible finite Markov chain into a non-reversible one, with a theoretical guarantee that the asymptotic variance of the MCMC estimator based on the non-reversible chain is reduced. The method is applicable to any reversible chain whose states are not connected through a tree, and can be interpreted graphically as inserting vortices into the state transition…
▽ More
We present a new way of converting a reversible finite Markov chain into a non-reversible one, with a theoretical guarantee that the asymptotic variance of the MCMC estimator based on the non-reversible chain is reduced. The method is applicable to any reversible chain whose states are not connected through a tree, and can be interpreted graphically as inserting vortices into the state transition graph. Our result confirms that non-reversible chains are fundamentally better than reversible ones in terms of asymptotic performance, and suggests interesting directions for further improving MCMC.
△ Less
Submitted 26 September, 2012;
originally announced September 2012.
-
Efficient Natural Evolution Strategies
Authors:
Yi Sun,
Daan Wierstra,
Tom Schaul,
Juergen Schmidhuber
Abstract:
Efficient Natural Evolution Strategies (eNES) is a novel alternative to conventional evolutionary algorithms, using the natural gradient to adapt the mutation distribution. Unlike previous methods based on natural gradients, eNES uses a fast algorithm to calculate the inverse of the exact Fisher information matrix, thus increasing both robustness and performance of its evolution gradient estimatio…
▽ More
Efficient Natural Evolution Strategies (eNES) is a novel alternative to conventional evolutionary algorithms, using the natural gradient to adapt the mutation distribution. Unlike previous methods based on natural gradients, eNES uses a fast algorithm to calculate the inverse of the exact Fisher information matrix, thus increasing both robustness and performance of its evolution gradient estimation, even in higher dimensions. Additional novel aspects of eNES include optimal fitness baselines and importance mixing (a procedure for updating the population with very few fitness evaluations). The algorithm yields competitive results on both unimodal and multimodal benchmarks.
△ Less
Submitted 26 September, 2012;
originally announced September 2012.
-
Object Recognition with Multi-Scale Pyramidal Pooling Networks
Authors:
Jonathan Masci,
Ueli Meier,
Gabriel Fricout,
Jürgen Schmidhuber
Abstract:
We present a Multi-Scale Pyramidal Pooling Network, featuring a novel pyramidal pooling layer at multiple scales and a novel encoding layer. Thanks to the former the network does not require all images of a given classification task to be of equal size. The encoding layer improves generalisation performance in comparison to similar neural network architectures, especially when training data is sca…
▽ More
We present a Multi-Scale Pyramidal Pooling Network, featuring a novel pyramidal pooling layer at multiple scales and a novel encoding layer. Thanks to the former the network does not require all images of a given classification task to be of equal size. The encoding layer improves generalisation performance in comparison to similar neural network architectures, especially when training data is scarce. We evaluate and compare our system to convolutional neural networks and state-of-the-art computer vision methods on various benchmark datasets. We also present results on industrial steel defect classification, where existing architectures are not applicable because of the constraint on equally sized input images. The proposed architecture can be seen as a fully supervised hierarchical bag-of-features extension that is trained online and can be fine-tuned for any given task.
△ Less
Submitted 7 July, 2012;
originally announced July 2012.
-
Multimodal similarity-preserving hashing
Authors:
Jonathan Masci,
Michael M. Bronstein,
Alexander A. Bronstein,
Jürgen Schmidhuber
Abstract:
We introduce an efficient computational framework for hashing data belonging to multiple modalities into a single representation space where they become mutually comparable. The proposed approach is based on a novel coupled siamese neural network architecture and allows unified treatment of intra- and inter-modality similarity learning. Unlike existing cross-modality similarity learning approaches…
▽ More
We introduce an efficient computational framework for hashing data belonging to multiple modalities into a single representation space where they become mutually comparable. The proposed approach is based on a novel coupled siamese neural network architecture and allows unified treatment of intra- and inter-modality similarity learning. Unlike existing cross-modality similarity learning approaches, our hashing functions are not limited to binarized linear projections and can assume arbitrarily complex forms. We show experimentally that our method significantly outperforms state-of-the-art hashing approaches on multimedia retrieval tasks.
△ Less
Submitted 6 July, 2012;
originally announced July 2012.
-
On the Size of the Online Kernel Sparsification Dictionary
Authors:
Yi Sun,
Faustino Gomez,
Juergen Schmidhuber
Abstract:
We analyze the size of the dictionary constructed from online kernel sparsification, using a novel formula that expresses the expected determinant of the kernel Gram matrix in terms of the eigenvalues of the covariance operator. Using this formula, we are able to connect the cardinality of the dictionary with the eigen-decay of the covariance operator. In particular, we show that under certain tec…
▽ More
We analyze the size of the dictionary constructed from online kernel sparsification, using a novel formula that expresses the expected determinant of the kernel Gram matrix in terms of the eigenvalues of the covariance operator. Using this formula, we are able to connect the cardinality of the dictionary with the eigen-decay of the covariance operator. In particular, we show that under certain technical conditions, the size of the dictionary will always grow sub-linearly in the number of data points, and, as a consequence, the kernel linear regressor constructed from the resulting dictionary is consistent.
△ Less
Submitted 18 June, 2012;
originally announced June 2012.
-
Multi-column Deep Neural Networks for Image Classification
Authors:
Dan Cireşan,
Ueli Meier,
Juergen Schmidhuber
Abstract:
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neur…
▽ More
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
△ Less
Submitted 13 February, 2012;
originally announced February 2012.
-
T-Learning
Authors:
Vincent Graziano,
Faustino Gomez,
Mark Ring,
Juergen Schmidhuber
Abstract:
Traditional Reinforcement Learning (RL) has focused on problems involving many states and few actions, such as simple grid worlds. Most real world problems, however, are of the opposite type, Involving Few relevant states and many actions. For example, to return home from a conference, humans identify only few subgoal states such as lobby, taxi, airport etc. Each valid behavior connecting two such…
▽ More
Traditional Reinforcement Learning (RL) has focused on problems involving many states and few actions, such as simple grid worlds. Most real world problems, however, are of the opposite type, Involving Few relevant states and many actions. For example, to return home from a conference, humans identify only few subgoal states such as lobby, taxi, airport etc. Each valid behavior connecting two such states can be viewed as an action, and there are trillions of them. Assuming the subgoal identification problem is already solved, the quality of any RL method---in real-world settings---depends less on how well it scales with the number of states than on how well it scales with the number of actions. This is where our new method T-Learning excels, by evaluating the relatively few possible transits from one state to another in a policy-independent way, rather than a huge number of state-action pairs, or states in traditional policy-dependent ways. Illustrative experiments demonstrate that performance improvements of T-Learning over Q-learning can be arbitrarily large.
△ Less
Submitted 31 December, 2011;
originally announced January 2012.
-
Descriptor learning for omnidirectional image matching
Authors:
Jonathan Masci,
Davide Migliore,
Michael M. Bronstein,
Jürgen Schmidhuber
Abstract:
Feature matching in omnidirectional vision systems is a challenging problem, mainly because complicated optical systems make the theoretical modelling of invariance and construction of invariant feature descriptors hard or even impossible. In this paper, we propose learning invariant descriptors using a training set of similar and dissimilar descriptor pairs. We use the similarity-preserving hashi…
▽ More
Feature matching in omnidirectional vision systems is a challenging problem, mainly because complicated optical systems make the theoretical modelling of invariance and construction of invariant feature descriptors hard or even impossible. In this paper, we propose learning invariant descriptors using a training set of similar and dissimilar descriptor pairs. We use the similarity-preserving hashing framework, in which we are trying to map the descriptor data to the Hamming space preserving the descriptor similarity on the training set. A neural network is used to solve the underlying optimization problem. Our approach outperforms not only straightforward descriptor matching, but also state-of-the-art similarity-preserving hashing methods.
△ Less
Submitted 29 December, 2011;
originally announced December 2011.
-
POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem
Authors:
Jürgen Schmidhuber
Abstract:
Most of computer science focuses on automatically solving given computational problems. I focus on automatically inventing or discovering problems in a way inspired by the playful behavior of animals and humans, to train a more and more general problem solver from scratch in an unsupervised fashion. Consider the infinite set of all computable descriptions of tasks with possibly computable solution…
▽ More
Most of computer science focuses on automatically solving given computational problems. I focus on automatically inventing or discovering problems in a way inspired by the playful behavior of animals and humans, to train a more and more general problem solver from scratch in an unsupervised fashion. Consider the infinite set of all computable descriptions of tasks with possibly computable solutions. The novel algorithmic framework POWERPLAY (2011) continually searches the space of possible pairs of new tasks and modifications of the current problem solver, until it finds a more powerful problem solver that provably solves all previously learned tasks plus the new one, while the unmodified predecessor does not. Wow-effects are achieved by continually making previously learned skills more efficient such that they require less time and space. New skills may (partially) re-use previously learned skills. POWERPLAY's search orders candidate pairs of tasks and solver modifications by their conditional computational (time & space) complexity, given the stored experience so far. The new task and its corresponding task-solving skill are those first found and validated. The computational costs of validating new tasks need not grow with task repertoire size. POWERPLAY's ongoing search for novelty keeps breaking the generalization abilities of its present solver. This is related to Goedel's sequence of increasingly powerful formal theories based on adding formerly unprovable statements to the axioms without affecting previously provable theorems. The continually increasing repertoire of problem solving procedures can be exploited by a parallel search for solutions to additional externally posed tasks. POWERPLAY may be viewed as a greedy but practical implementation of basic principles of creativity. A first experimental analysis can be found in separate papers [53,54].
△ Less
Submitted 4 November, 2012; v1 submitted 22 December, 2011;
originally announced December 2011.
-
Incremental Slow Feature Analysis: Adaptive and Episodic Learning from High-Dimensional Input Streams
Authors:
Varun Raj Kompella,
Matthew Luciw,
Juergen Schmidhuber
Abstract:
Slow Feature Analysis (SFA) extracts features representing the underlying causes of changes within a temporally coherent high-dimensional raw sensory input signal. Our novel incremental version of SFA (IncSFA) combines incremental Principal Components Analysis and Minor Components Analysis. Unlike standard batch-based SFA, IncSFA adapts along with non-stationary environments, is amenable to episod…
▽ More
Slow Feature Analysis (SFA) extracts features representing the underlying causes of changes within a temporally coherent high-dimensional raw sensory input signal. Our novel incremental version of SFA (IncSFA) combines incremental Principal Components Analysis and Minor Components Analysis. Unlike standard batch-based SFA, IncSFA adapts along with non-stationary environments, is amenable to episodic training, is not corrupted by outliers, and is covariance-free. These properties make IncSFA a generally useful unsupervised preprocessor for autonomous learning agents and robots. In IncSFA, the CCIPCA and MCA updates take the form of Hebbian and anti-Hebbian updating, extending the biological plausibility of SFA. In both single node and deep network versions, IncSFA learns to encode its input streams (such as high-dimensional video) by informative slow features representing meaningful abstract environmental properties. It can handle cases where batch SFA fails.
△ Less
Submitted 9 December, 2011;
originally announced December 2011.
-
Measuring Intelligence through Games
Authors:
Tom Schaul,
Julian Togelius,
Jürgen Schmidhuber
Abstract:
Artificial general intelligence (AGI) refers to research aimed at tackling the full problem of artificial intelligence, that is, create truly intelligent agents. This sets it apart from most AI research which aims at solving relatively narrow domains, such as character recognition, motion planning, or increasing player satisfaction in games. But how do we know when an agent is truly intelligent? A…
▽ More
Artificial general intelligence (AGI) refers to research aimed at tackling the full problem of artificial intelligence, that is, create truly intelligent agents. This sets it apart from most AI research which aims at solving relatively narrow domains, such as character recognition, motion planning, or increasing player satisfaction in games. But how do we know when an agent is truly intelligent? A common point of reference in the AGI community is Legg and Hutter's formal definition of universal intelligence, which has the appeal of simplicity and generality but is unfortunately incomputable. Games of various kinds are commonly used as benchmarks for "narrow" AI research, as they are considered to have many important properties. We argue that many of these properties carry over to the testing of general intelligence as well. We then sketch how such testing could practically be carried out. The central part of this sketch is an extension of universal intelligence to deal with finite time, and the use of sampling of the space of games expressed in a suitably biased game description language.
△ Less
Submitted 6 September, 2011;
originally announced September 2011.
-
Natural Evolution Strategies
Authors:
Daan Wierstra,
Tom Schaul,
Tobias Glasmachers,
Yi Sun,
Jürgen Schmidhuber
Abstract:
This paper presents Natural Evolution Strategies (NES), a recent family of algorithms that constitute a more principled approach to black-box optimization than established evolutionary algorithms. NES maintains a parameterized distribution on the set of solution candidates, and the natural gradient is used to update the distribution's parameters in the direction of higher expected fitness. We intr…
▽ More
This paper presents Natural Evolution Strategies (NES), a recent family of algorithms that constitute a more principled approach to black-box optimization than established evolutionary algorithms. NES maintains a parameterized distribution on the set of solution candidates, and the natural gradient is used to update the distribution's parameters in the direction of higher expected fitness. We introduce a collection of techniques that address issues of convergence, robustness, sample complexity, computational complexity and sensitivity to hyperparameters. This paper explores a number of implementations of the NES family, ranging from general-purpose multi-variate normal distributions to heavy-tailed and separable distributions tailored towards global optimization and search in high dimensional spaces, respectively. Experimental results show best published performance on various standard benchmarks, as well as competitive performance on others.
△ Less
Submitted 22 June, 2011;
originally announced June 2011.
-
A Linear Time Natural Evolution Strategy for Non-Separable Functions
Authors:
Yi Sun,
Faustino Gomez,
Tom Schaul,
Juergen Schmidhuber
Abstract:
We present a novel Natural Evolution Strategy (NES) variant, the Rank-One NES (R1-NES), which uses a low rank approximation of the search distribution covariance matrix. The algorithm allows computation of the natural gradient with cost linear in the dimensionality of the parameter space, and excels in solving high-dimensional non-separable problems, including the best result to date on the Rosenb…
▽ More
We present a novel Natural Evolution Strategy (NES) variant, the Rank-One NES (R1-NES), which uses a low rank approximation of the search distribution covariance matrix. The algorithm allows computation of the natural gradient with cost linear in the dimensionality of the parameter space, and excels in solving high-dimensional non-separable problems, including the best result to date on the Rosenbrock function (512 dimensions).
△ Less
Submitted 13 June, 2011; v1 submitted 10 June, 2011;
originally announced June 2011.
-
Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments
Authors:
Yi Sun,
Faustino Gomez,
Juergen Schmidhuber
Abstract:
To maximize its success, an AGI typically needs to explore its initially unknown world. Is there an optimal way of doing so? Here we derive an affirmative answer for a broad class of environments.
To maximize its success, an AGI typically needs to explore its initially unknown world. Is there an optimal way of doing so? Here we derive an affirmative answer for a broad class of environments.
△ Less
Submitted 29 March, 2011;
originally announced March 2011.
-
Handwritten Digit Recognition with a Committee of Deep Neural Nets on GPUs
Authors:
Dan C. Cireşan,
Ueli Meier,
Luca M. Gambardella,
Jürgen Schmidhuber
Abstract:
The competitive MNIST handwritten digit recognition benchmark has a long history of broken records since 1998. The most recent substantial improvement by others dates back 7 years (error rate 0.4%) . Recently we were able to significantly improve this result, using graphics cards to greatly speed up training of simple but deep MLPs, which achieved 0.35%, outperforming all the previous more complex…
▽ More
The competitive MNIST handwritten digit recognition benchmark has a long history of broken records since 1998. The most recent substantial improvement by others dates back 7 years (error rate 0.4%) . Recently we were able to significantly improve this result, using graphics cards to greatly speed up training of simple but deep MLPs, which achieved 0.35%, outperforming all the previous more complex methods. Here we report another substantial improvement: 0.31% obtained using a committee of MLPs.
△ Less
Submitted 23 March, 2011;
originally announced March 2011.
-
High-Performance Neural Networks for Visual Object Classification
Authors:
Dan C. Cireşan,
Ueli Meier,
Jonathan Masci,
Luca M. Gambardella,
Jürgen Schmidhuber
Abstract:
We present a fast, fully parameterizable GPU implementation of Convolutional Neural Network variants. Our feature extractors are neither carefully designed nor pre-wired, but rather learned in a supervised way. Our deep hierarchical architectures achieve the best published results on benchmarks for object classification (NORB, CIFAR10) and handwritten digit recognition (MNIST), with error rates of…
▽ More
We present a fast, fully parameterizable GPU implementation of Convolutional Neural Network variants. Our feature extractors are neither carefully designed nor pre-wired, but rather learned in a supervised way. Our deep hierarchical architectures achieve the best published results on benchmarks for object classification (NORB, CIFAR10) and handwritten digit recognition (MNIST), with error rates of 2.53%, 19.51%, 0.35%, respectively. Deep nets trained by simple back-propagation perform better than more shallow ones. Learning is surprisingly rapid. NORB is completely trained within five epochs. Test error rates on MNIST drop to 2.42%, 0.97% and 0.48% after 1, 3 and 17 epochs, respectively.
△ Less
Submitted 1 February, 2011;
originally announced February 2011.
-
Evolution of National Nobel Prize Shares in the 20th Century
Authors:
Juergen Schmidhuber
Abstract:
We analyze the evolution of cumulative national shares of Nobel Prizes since 1901, properly taking into account that most prizes were divided among several laureates. We rank by citizenship at the moment of the award, and by country of birth. Surprisingly, graphs of this type have not been published before, even though they powerfully illustrate the century's migration patterns (brain drains and g…
▽ More
We analyze the evolution of cumulative national shares of Nobel Prizes since 1901, properly taking into account that most prizes were divided among several laureates. We rank by citizenship at the moment of the award, and by country of birth. Surprisingly, graphs of this type have not been published before, even though they powerfully illustrate the century's migration patterns (brain drains and gains) in the sciences and other fields.
△ Less
Submitted 14 September, 2010;
originally announced September 2010.
-
Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition
Authors:
Dan Claudiu Ciresan,
Ueli Meier,
Luca Maria Gambardella,
Juergen Schmidhuber
Abstract:
Good old on-line back-propagation for plain multi-layer perceptrons yields a very low 0.35% error rate on the famous MNIST handwritten digits benchmark. All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images, and graphics cards to greatly speed up learning.
Good old on-line back-propagation for plain multi-layer perceptrons yields a very low 0.35% error rate on the famous MNIST handwritten digits benchmark. All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images, and graphics cards to greatly speed up learning.
△ Less
Submitted 1 March, 2010;
originally announced March 2010.
-
Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes
Authors:
Juergen Schmidhuber
Abstract:
I argue that data becomes temporarily interesting by itself to some self-improving, but computationally limited, subjective observer once he learns to predict or compress the data in a better way, thus making it subjectively simpler and more beautiful. Curiosity is the desire to create or discover more non-random, non-arbitrary, regular data that is novel and surprising not in the traditional se…
▽ More
I argue that data becomes temporarily interesting by itself to some self-improving, but computationally limited, subjective observer once he learns to predict or compress the data in a better way, thus making it subjectively simpler and more beautiful. Curiosity is the desire to create or discover more non-random, non-arbitrary, regular data that is novel and surprising not in the traditional sense of Boltzmann and Shannon but in the sense that it allows for compression progress because its regularity was not yet known. This drive maximizes interestingness, the first derivative of subjective beauty or compressibility, that is, the steepness of the learning curve. It motivates exploring infants, pure mathematicians, composers, artists, dancers, comedians, yourself, and (since 1990) artificial systems.
△ Less
Submitted 15 April, 2009; v1 submitted 23 December, 2008;
originally announced December 2008.
-
Algorithm Selection as a Bandit Problem with Unbounded Losses
Authors:
Matteo Gagliolo,
Juergen Schmidhuber
Abstract:
Algorithm selection is typically based on models of algorithm performance, learned during a separate offline training sequence, which can be prohibitively expensive. In recent work, we adopted an online approach, in which a performance model is iteratively updated and used to guide selection on a sequence of problem instances. The resulting exploration-exploitation trade-off was represented as a…
▽ More
Algorithm selection is typically based on models of algorithm performance, learned during a separate offline training sequence, which can be prohibitively expensive. In recent work, we adopted an online approach, in which a performance model is iteratively updated and used to guide selection on a sequence of problem instances. The resulting exploration-exploitation trade-off was represented as a bandit problem with expert advice, using an existing solver for this game, but this required the setting of an arbitrary bound on algorithm runtimes, thus invalidating the optimal regret of the solver. In this paper, we propose a simpler framework for representing algorithm selection as a bandit problem, with partial information, and an unknown bound on losses. We adapt an existing solver to this game, proving a bound on its expected regret, which holds also for the resulting algorithm selection technique. We present preliminary experiments with a set of SAT solvers on a mixed SAT-UNSAT benchmark.
△ Less
Submitted 9 July, 2008;
originally announced July 2008.
-
Phoneme recognition in TIMIT with BLSTM-CTC
Authors:
Santiago Fernández,
Alex Graves,
Juergen Schmidhuber
Abstract:
We compare the performance of a recurrent neural network with the best results published so far on phoneme recognition in the TIMIT database. These published results have been obtained with a combination of classifiers. However, in this paper we apply a single recurrent neural network to the same task. Our recurrent neural network attains an error rate of 24.6%. This result is not significantly…
▽ More
We compare the performance of a recurrent neural network with the best results published so far on phoneme recognition in the TIMIT database. These published results have been obtained with a combination of classifiers. However, in this paper we apply a single recurrent neural network to the same task. Our recurrent neural network attains an error rate of 24.6%. This result is not significantly different from that obtained by the other best methods, but they rely on a combination of classifiers for achieving comparable performance.
△ Less
Submitted 21 April, 2008;
originally announced April 2008.
-
Simple Algorithmic Principles of Discovery, Subjective Beauty, Selective Attention, Curiosity & Creativity
Authors:
Juergen Schmidhuber
Abstract:
I postulate that human or other intelligent agents function or should function as follows. They store all sensory observations as they come - the data is holy. At any time, given some agent's current coding capabilities, part of the data is compressible by a short and hopefully fast program / description / explanation / world model. In the agent's subjective eyes, such data is more regular and m…
▽ More
I postulate that human or other intelligent agents function or should function as follows. They store all sensory observations as they come - the data is holy. At any time, given some agent's current coding capabilities, part of the data is compressible by a short and hopefully fast program / description / explanation / world model. In the agent's subjective eyes, such data is more regular and more "beautiful" than other data. It is well-known that knowledge of regularity and repeatability may improve the agent's ability to plan actions leading to external rewards. In absence of such rewards, however, known beauty is boring. Then "interestingness" becomes the first derivative of subjective beauty: as the learning agent improves its compression algorithm, formerly apparently random data parts become subjectively more regular and beautiful. Such progress in compressibility is measured and maximized by the curiosity drive: create action sequences that extend the observation history and yield previously unknown / unpredictable but quickly learnable algorithmic regularity. We discuss how all of the above can be naturally implemented on computers, through an extension of passive unsupervised learning to the case of active data selection: we reward a general reinforcement learner (with access to the adaptive compressor) for actions that improve the subjective compressibility of the growing data. An unusually large breakthrough in compressibility deserves the name "discovery". The "creativity" of artists, dancers, musicians, pure mathematicians can be viewed as a by-product of this principle. Several qualitative examples support this hypothesis.
△ Less
Submitted 5 September, 2007;
originally announced September 2007.
-
Using Data Compressors to Construct Rank Tests
Authors:
Daniil Ryabko,
Juergen Schmidhuber
Abstract:
Nonparametric rank tests for homogeneity and component independence are proposed, which are based on data compressors. For homogeneity testing the idea is to compress the binary string obtained by ordering the two joint samples and writing 0 if the element is from the first sample and 1 if it is from the second sample and breaking ties by randomization (extension to the case of multiple samples…
▽ More
Nonparametric rank tests for homogeneity and component independence are proposed, which are based on data compressors. For homogeneity testing the idea is to compress the binary string obtained by ordering the two joint samples and writing 0 if the element is from the first sample and 1 if it is from the second sample and breaking ties by randomization (extension to the case of multiple samples is straightforward). $H_0$ should be rejected if the string is compressed (to a certain degree) and accepted otherwise. We show that such a test obtained from an ideal data compressor is valid against all alternatives. Component independence is reduced to homogeneity testing by constructing two samples, one of which is the first half of the original and the other is the second half with one of the components randomly permuted.
△ Less
Submitted 5 September, 2007;
originally announced September 2007.
-
2006: Celebrating 75 years of AI - History and Outlook: the Next 25 Years
Authors:
Juergen Schmidhuber
Abstract:
When Kurt Goedel layed the foundations of theoretical computer science in 1931, he also introduced essential concepts of the theory of Artificial Intelligence (AI). Although much of subsequent AI research has focused on heuristics, which still play a major role in many practical AI applications, in the new millennium AI theory has finally become a full-fledged formal science, with important opti…
▽ More
When Kurt Goedel layed the foundations of theoretical computer science in 1931, he also introduced essential concepts of the theory of Artificial Intelligence (AI). Although much of subsequent AI research has focused on heuristics, which still play a major role in many practical AI applications, in the new millennium AI theory has finally become a full-fledged formal science, with important optimality results for embodied agents living in unknown environments, obtained through a combination of theory a la Goedel and probability theory. Here we look back at important milestones of AI history, mention essential recent results, and speculate about what we may expect from the next 25 years, emphasizing the significance of the ongoing dramatic hardware speedups, and discussing Goedel-inspired, self-referential, self-improving universal problem solvers.
△ Less
Submitted 31 August, 2007;
originally announced August 2007.
-
Multi-Dimensional Recurrent Neural Networks
Authors:
Alex Graves,
Santiago Fernandez,
Juergen Schmidhuber
Abstract:
Recurrent neural networks (RNNs) have proved effective at one dimensional sequence learning tasks, such as speech and online handwriting recognition. Some of the properties that make RNNs suitable for such tasks, for example robustness to input warping, and the ability to access contextual information, are also desirable in multidimensional domains. However, there has so far been no direct way o…
▽ More
Recurrent neural networks (RNNs) have proved effective at one dimensional sequence learning tasks, such as speech and online handwriting recognition. Some of the properties that make RNNs suitable for such tasks, for example robustness to input warping, and the ability to access contextual information, are also desirable in multidimensional domains. However, there has so far been no direct way of applying RNNs to data with more than one spatio-temporal dimension. This paper introduces multi-dimensional recurrent neural networks (MDRNNs), thereby extending the potential applicability of RNNs to vision, video processing, medical imaging and many other areas, while avoiding the scaling problems that have plagued other multi-dimensional models. Experimental results are provided for two image segmentation tasks.
△ Less
Submitted 14 May, 2007;
originally announced May 2007.
-
Algorithmic Complexity Bounds on Future Prediction Errors
Authors:
A. Chernov,
M. Hutter,
J. Schmidhuber
Abstract:
We bound the future loss when predicting any (computably) stochastic sequence online. Solomonoff finitely bounded the total deviation of his universal predictor $M$ from the true distribution $mu$ by the algorithmic complexity of $mu$. Here we assume we are at a time $t>1$ and already observed $x=x_1...x_t$. We bound the future prediction performance on $x_{t+1}x_{t+2}...$ by a new variant of al…
▽ More
We bound the future loss when predicting any (computably) stochastic sequence online. Solomonoff finitely bounded the total deviation of his universal predictor $M$ from the true distribution $mu$ by the algorithmic complexity of $mu$. Here we assume we are at a time $t>1$ and already observed $x=x_1...x_t$. We bound the future prediction performance on $x_{t+1}x_{t+2}...$ by a new variant of algorithmic complexity of $mu$ given $x$, plus the complexity of the randomness deficiency of $x$. The new complexity is monotone in its condition in the sense that this complexity can only decrease if the condition is prolonged. We also briefly discuss potential generalizations to Bayesian model classes and to classification problems.
△ Less
Submitted 19 January, 2007;
originally announced January 2007.
-
New Millennium AI and the Convergence of History
Authors:
Juergen Schmidhuber
Abstract:
Artificial Intelligence (AI) has recently become a real formal science: the new millennium brought the first mathematically sound, asymptotically optimal, universal problem solvers, providing a new, rigorous foundation for the previously largely heuristic field of General AI and embedded agents. At the same time there has been rapid progress in practical methods for learning true sequence-proces…
▽ More
Artificial Intelligence (AI) has recently become a real formal science: the new millennium brought the first mathematically sound, asymptotically optimal, universal problem solvers, providing a new, rigorous foundation for the previously largely heuristic field of General AI and embedded agents. At the same time there has been rapid progress in practical methods for learning true sequence-processing programs, as opposed to traditional methods limited to stationary pattern association. Here we will briefly review some of the new results, and speculate about future developments, pointing out that the time intervals between the most notable events in over 40,000 years or 2^9 lifetimes of human history have sped up exponentially, apparently converging to zero within the next few decades. Or is this impression just a by-product of the way humans allocate memory space to past events?
△ Less
Submitted 29 June, 2006; v1 submitted 19 June, 2006;
originally announced June 2006.
-
Metric State Space Reinforcement Learning for a Vision-Capable Mobile Robot
Authors:
Viktor Zhumatiy,
Faustino Gomez,
Marcus Hutter,
Juergen Schmidhuber
Abstract:
We address the problem of autonomously learning controllers for vision-capable mobile robots. We extend McCallum's (1995) Nearest-Sequence Memory algorithm to allow for general metrics over state-action trajectories. We demonstrate the feasibility of our approach by successfully running our algorithm on a real mobile robot. The algorithm is novel and unique in that it (a) explores the environmen…
▽ More
We address the problem of autonomously learning controllers for vision-capable mobile robots. We extend McCallum's (1995) Nearest-Sequence Memory algorithm to allow for general metrics over state-action trajectories. We demonstrate the feasibility of our approach by successfully running our algorithm on a real mobile robot. The algorithm is novel and unique in that it (a) explores the environment and learns directly on a mobile robot without using a hand-made computer model as an intermediate step, (b) does not require manual discretization of the sensor input space, (c) works in piecewise continuous perceptual spaces, and (d) copes with partial observability. Together this allows learning from much less experience compared to previous methods.
△ Less
Submitted 7 March, 2006;
originally announced March 2006.
-
Evolino for recurrent support vector machines
Authors:
Juergen Schmidhuber,
Matteo Gagliolo,
Daan Wierstra,
Faustino Gomez
Abstract:
Traditional Support Vector Machines (SVMs) need pre-wired finite time windows to predict and classify time series. They do not have an internal state necessary to deal with sequences involving arbitrary long-term dependencies. Here we introduce a new class of recurrent, truly sequential SVM-like devices with internal adaptive states, trained by a novel method called EVOlution of systems with KEr…
▽ More
Traditional Support Vector Machines (SVMs) need pre-wired finite time windows to predict and classify time series. They do not have an internal state necessary to deal with sequences involving arbitrary long-term dependencies. Here we introduce a new class of recurrent, truly sequential SVM-like devices with internal adaptive states, trained by a novel method called EVOlution of systems with KErnel-based outputs (Evoke), an instance of the recent Evolino class of methods. Evoke evolves recurrent neural networks to detect and represent temporal dependencies while using quadratic programming/support vector regression to produce precise outputs. Evoke is the first SVM-based mechanism learning to classify a context-sensitive language. It also outperforms recent state-of-the-art gradient-based recurrent neural networks (RNNs) on various time series prediction tasks.
△ Less
Submitted 15 December, 2005;
originally announced December 2005.
-
Goedel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements
Authors:
Juergen Schmidhuber
Abstract:
We present the first class of mathematically rigorous, general, fully self-referential, self-improving, optimally efficient problem solvers. Inspired by Kurt Goedel's celebrated self-referential formulas (1931), such a problem solver rewrites any part of its own code as soon as it has found a proof that the rewrite is useful, where the problem-dependent utility function and the hardware and the…
▽ More
We present the first class of mathematically rigorous, general, fully self-referential, self-improving, optimally efficient problem solvers. Inspired by Kurt Goedel's celebrated self-referential formulas (1931), such a problem solver rewrites any part of its own code as soon as it has found a proof that the rewrite is useful, where the problem-dependent utility function and the hardware and the entire initial code are described by axioms encoded in an initial proof searcher which is also part of the initial code. The searcher systematically and efficiently tests computable proof techniques (programs whose outputs are proofs) until it finds a provably useful, computable self-rewrite. We show that such a self-rewrite is globally optimal - no local maxima! - since the code first had to prove that it is not useful to continue the proof search for alternative self-rewrites. Unlike previous non-self-referential methods based on hardwired proof searchers, ours not only boasts an optimal order of complexity but can optimally reduce any slowdowns hidden by the O()-notation, provided the utility of such speed-ups is provable at all.
△ Less
Submitted 17 December, 2006; v1 submitted 25 September, 2003;
originally announced September 2003.
-
The New AI: General & Sound & Relevant for Physics
Authors:
Juergen Schmidhuber
Abstract:
Most traditional artificial intelligence (AI) systems of the past 50 years are either very limited, or based on heuristics, or both. The new millennium, however, has brought substantial progress in the field of theoretically optimal and practically feasible algorithms for prediction, search, inductive inference based on Occam's razor, problem solving, decision making, and reinforcement learning…
▽ More
Most traditional artificial intelligence (AI) systems of the past 50 years are either very limited, or based on heuristics, or both. The new millennium, however, has brought substantial progress in the field of theoretically optimal and practically feasible algorithms for prediction, search, inductive inference based on Occam's razor, problem solving, decision making, and reinforcement learning in environments of a very general type. Since inductive inference is at the heart of all inductive sciences, some of the results are relevant not only for AI and computer science but also for physics, provoking nontraditional predictions based on Zuse's thesis of the computer-generated universe.
△ Less
Submitted 27 November, 2003; v1 submitted 10 February, 2003;
originally announced February 2003.
-
Optimal Ordered Problem Solver
Authors:
Juergen Schmidhuber
Abstract:
We present a novel, general, optimally fast, incremental way of searching for a universal algorithm that solves each task in a sequence of tasks. The Optimal Ordered Problem Solver (OOPS) continually organizes and exploits previously found solutions to earlier tasks, efficiently searching not only the space of domain-specific algorithms, but also the space of search algorithms. Essentially we ex…
▽ More
We present a novel, general, optimally fast, incremental way of searching for a universal algorithm that solves each task in a sequence of tasks. The Optimal Ordered Problem Solver (OOPS) continually organizes and exploits previously found solutions to earlier tasks, efficiently searching not only the space of domain-specific algorithms, but also the space of search algorithms. Essentially we extend the principles of optimal nonincremental universal search to build an incremental universal learner that is able to improve itself through experience. In illustrative experiments, our self-improver becomes the first general system that learns to solve all n disk Towers of Hanoi tasks (solution size 2^n-1) for n up to 30, profiting from previously solved, simpler tasks involving samples of a simple context free language.
△ Less
Submitted 23 December, 2002; v1 submitted 31 July, 2002;
originally announced July 2002.
-
Gradient-based Reinforcement Planning in Policy-Search Methods
Authors:
Ivo Kwee,
Marcus Hutter,
Juergen Schmidhuber
Abstract:
We introduce a learning method called ``gradient-based reinforcement planning'' (GREP). Unlike traditional DP methods that improve their policy backwards in time, GREP is a gradient-based method that plans ahead and improves its policy before it actually acts in the environment. We derive formulas for the exact policy gradient that maximizes the expected future reward and confirm our ideas with…
▽ More
We introduce a learning method called ``gradient-based reinforcement planning'' (GREP). Unlike traditional DP methods that improve their policy backwards in time, GREP is a gradient-based method that plans ahead and improves its policy before it actually acts in the environment. We derive formulas for the exact policy gradient that maximizes the expected future reward and confirm our ideas with numerical experiments.
△ Less
Submitted 28 November, 2001;
originally announced November 2001.