subscribe to arXiv mailings

Highway Networks

Authors: Rupesh Kumar Srivastava, Klaus Greff, Jürgen Schmidhuber

Abstract: There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult with increasing depth and training of very deep networks remains an open problem. In this extended abstract, we introduce a new architecture designed to ease gradient-based training of very deep networks. We refer to network… ▽ More There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult with increasing depth and training of very deep networks remains an open problem. In this extended abstract, we introduce a new architecture designed to ease gradient-based training of very deep networks. We refer to networks with this architecture as highway networks, since they allow unimpeded information flow across several layers on "information highways". The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions, opening up the possibility of studying extremely deep and efficient architectures. △ Less

Submitted 3 November, 2015; v1 submitted 2 May, 2015; originally announced May 2015.

Comments: 6 pages, 2 figures. Presented at ICML 2015 Deep Learning workshop. Full paper is at arXiv:1507.06228

MSC Class: 68T01 ACM Class: I.2.6; G.1.6

arXiv:1503.04069 [pdf, other]

doi 10.1109/TNNLS.2016.2582924

LSTM: A Search Space Odyssey

Authors: Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber

Abstract: Several variants of the Long Short-Term Memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art models for a variety of machine learning problems. This has led to a renewed interest in understanding the role and utility of various computational components of typical LSTM variants. In t… ▽ More Several variants of the Long Short-Term Memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art models for a variety of machine learning problems. This has led to a renewed interest in understanding the role and utility of various computational components of typical LSTM variants. In this paper, we present the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling. The hyperparameters of all LSTM variants for each task were optimized separately using random search, and their importance was assessed using the powerful fANOVA framework. In total, we summarize the results of 5400 experimental runs ($\approx 15$ years of CPU time), which makes our study the largest of its kind on LSTM networks. Our results show that none of the variants can improve upon the standard LSTM architecture significantly, and demonstrate the forget gate and the output activation function to be its most critical components. We further observe that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment. △ Less

Submitted 4 October, 2017; v1 submitted 13 March, 2015; originally announced March 2015.

Comments: 12 pages, 6 figures

MSC Class: 68T10 ACM Class: I.2.6; I.2.7; I.5.1; H.5.5

Journal ref: IEEE Transactions on Neural Networks and Learning Systems ( Volume: 28, Issue: 10, Oct. 2017 ) Pages: 2222 - 2232

arXiv:1411.5825 [pdf]

doi 10.1016/j.media.2014.11.010

Assessment of algorithms for mitosis detection in breast cancer histopathology images

Authors: Mitko Veta, Paul J. van Diest, Stefan M. Willems, Haibo Wang, Anant Madabhushi, Angel Cruz-Roa, Fabio Gonzalez, Anders B. L. Larsen, Jacob S. Vestergaard, Anders B. Dahl, Dan C. Cireşan, Jürgen Schmidhuber, Alessandro Giusti, Luca M. Gambardella, F. Boray Tek, Thomas Walter, Ching-Wei Wang, Satoshi Kondo, Bogdan J. Matuszewski, Frederic Precioso, Violet Snell, Josef Kittler, Teofilo E. de Campos, Adnan M. Khan, Nasir M. Rajpoot , et al. (4 additional authors not shown)

Abstract: The proliferative activity of breast tumors, which is routinely estimated by counting of mitotic figures in hematoxylin and eosin stained histology sections, is considered to be one of the most important prognostic markers. However, mitosis counting is laborious, subjective and may suffer from low inter-observer agreement. With the wider acceptance of whole slide images in pathology labs, automati… ▽ More The proliferative activity of breast tumors, which is routinely estimated by counting of mitotic figures in hematoxylin and eosin stained histology sections, is considered to be one of the most important prognostic markers. However, mitosis counting is laborious, subjective and may suffer from low inter-observer agreement. With the wider acceptance of whole slide images in pathology labs, automatic image analysis has been proposed as a potential solution for these issues. In this paper, the results from the Assessment of Mitosis Detection Algorithms 2013 (AMIDA13) challenge are described. The challenge was based on a data set consisting of 12 training and 11 testing subjects, with more than one thousand annotated mitotic figures by multiple observers. Short descriptions and results from the evaluation of eleven methods are presented. The top performing method has an error rate that is comparable to the inter-observer agreement among pathologists. △ Less

Submitted 21 November, 2014; originally announced November 2014.

Comments: 23 pages, 5 figures, accepted for publication in the journal Medical Image Analysis

arXiv:1410.1165 [pdf, other]

Understanding Locally Competitive Networks

Authors: Rupesh Kumar Srivastava, Jonathan Masci, Faustino Gomez, Jürgen Schmidhuber

Abstract: Recently proposed neural network activation functions such as rectified linear, maxout, and local winner-take-all have allowed for faster and more effective training of deep neural architectures on large and complex datasets. The common trait among these functions is that they implement local competition between small groups of computational units within a layer, so that only part of the network i… ▽ More Recently proposed neural network activation functions such as rectified linear, maxout, and local winner-take-all have allowed for faster and more effective training of deep neural architectures on large and complex datasets. The common trait among these functions is that they implement local competition between small groups of computational units within a layer, so that only part of the network is activated for any given input pattern. In this paper, we attempt to visualize and understand this self-modularization, and suggest a unified explanation for the beneficial properties of such networks. We also show how our insights can be directly useful for efficiently performing retrieval over large datasets using neural networks. △ Less

Submitted 8 April, 2015; v1 submitted 5 October, 2014; originally announced October 2014.

Comments: 9 pages + 2 supplementary, Accepted to ICLR 2015 Conference track

MSC Class: 68T30; 68T10 ACM Class: I.2.6

arXiv:1407.3068 [pdf, ps, other]

Deep Networks with Internal Selective Attention through Feedback Connections

Authors: Marijn Stollenga, Jonathan Masci, Faustino Gomez, Juergen Schmidhuber

Abstract: Traditional convolutional neural networks (CNN) are stationary and feedforward. They neither change their parameters during evaluation nor use feedback from higher to lower layers. Real brains, however, do. So does our Deep Attention Selective Network (dasNet) architecture. DasNets feedback structure can dynamically alter its convolutional filter sensitivities during classification. It harnesses t… ▽ More Traditional convolutional neural networks (CNN) are stationary and feedforward. They neither change their parameters during evaluation nor use feedback from higher to lower layers. Real brains, however, do. So does our Deep Attention Selective Network (dasNet) architecture. DasNets feedback structure can dynamically alter its convolutional filter sensitivities during classification. It harnesses the power of sequential processing to improve classification performance, by allowing the network to iteratively focus its internal attention on some of its convolutional filters. Feedback is trained through direct policy search in a huge million-dimensional parameter space, through scalable natural evolution strategies (SNES). On the CIFAR-10 and CIFAR-100 datasets, dasNet outperforms the previous state-of-the-art model. △ Less

Submitted 28 July, 2014; v1 submitted 11 July, 2014; originally announced July 2014.

Comments: 13 pages, 3 figures

MSC Class: 68T45

arXiv:1404.7828 [pdf, other]

doi 10.1016/j.neunet.2014.09.003

Deep Learning in Neural Networks: An Overview

Authors: Juergen Schmidhuber

Abstract: In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between… ▽ More In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks. △ Less

Submitted 8 October, 2014; v1 submitted 30 April, 2014; originally announced April 2014.

Comments: 88 pages, 888 references

Report number: Technical Report IDSIA-03-14

Journal ref: Neural Networks, Vol 61, pp 85-117, Jan 2015

arXiv:1402.3511 [pdf, other]

A Clockwork RNN

Authors: Jan Koutník, Klaus Greff, Faustino Gomez, Jürgen Schmidhuber

Abstract: Sequence prediction and classification are ubiquitous and challenging problems in machine learning that can require identifying complex dependencies between temporally distant inputs. Recurrent Neural Networks (RNNs) have the ability, in theory, to cope with these temporal dependencies by virtue of the short-term memory implemented by their recurrent (feedback) connections. However, in practice th… ▽ More Sequence prediction and classification are ubiquitous and challenging problems in machine learning that can require identifying complex dependencies between temporally distant inputs. Recurrent Neural Networks (RNNs) have the ability, in theory, to cope with these temporal dependencies by virtue of the short-term memory implemented by their recurrent (feedback) connections. However, in practice they are difficult to train successfully when the long-term memory is required. This paper introduces a simple, yet powerful modification to the standard RNN architecture, the Clockwork RNN (CW-RNN), in which the hidden layer is partitioned into separate modules, each processing inputs at its own temporal granularity, making computations only at its prescribed clock rate. Rather than making the standard RNN models more complex, CW-RNN reduces the number of RNN parameters, improves the performance significantly in the tasks tested, and speeds up the network evaluation. The network is demonstrated in preliminary experiments involving two tasks: audio signal generation and TIMIT spoken word classification, where it outperforms both RNN and LSTM networks. △ Less

Submitted 14 February, 2014; originally announced February 2014.

arXiv:1312.6764 [pdf]

Bounded Recursive Self-Improvement

Authors: E. Nivel, K. R. Thórisson, B. R. Steunebrink, H. Dindo, G. Pezzulo, M. Rodriguez, C. Hernandez, D. Ognibene, J. Schmidhuber, R. Sanz, H. P. Helgason, A. Chella, G. K. Jonsson

Abstract: We have designed a machine that becomes increasingly better at behaving in underspecified circumstances, in a goal-directed way, on the job, by modeling itself and its environment as experience accumulates. Based on principles of autocatalysis, endogeny, and reflectivity, the work provides an architectural blueprint for constructing systems with high levels of operational autonomy in underspecifie… ▽ More We have designed a machine that becomes increasingly better at behaving in underspecified circumstances, in a goal-directed way, on the job, by modeling itself and its environment as experience accumulates. Based on principles of autocatalysis, endogeny, and reflectivity, the work provides an architectural blueprint for constructing systems with high levels of operational autonomy in underspecified circumstances, starting from a small seed. Through value-driven dynamic priority scheduling controlling the parallel execution of a vast number of reasoning threads, the system achieves recursive self-improvement after it leaves the lab, within the boundaries imposed by its designers. A prototype system has been implemented and demonstrated to learn a complex real-world task, real-time multimodal dialogue with humans, by on-line observation. Our work presents solutions to several challenges that must be solved for achieving artificial general intelligence. △ Less

Submitted 24 December, 2013; originally announced December 2013.

Report number: RUTR-SCS13006

arXiv:1312.5548 [pdf, other]

My First Deep Learning System of 1991 + Deep Learning Timeline 1962-2013

Authors: Jürgen Schmidhuber

Abstract: Deep Learning has attracted significant attention in recent years. Here I present a brief overview of my first Deep Learner of 1991, and its historic context, with a timeline of Deep Learning highlights. Deep Learning has attracted significant attention in recent years. Here I present a brief overview of my first Deep Learner of 1991, and its historic context, with a timeline of Deep Learning highlights. △ Less

Submitted 19 December, 2013; originally announced December 2013.

Comments: 11 pages. As a machine learning researcher I am obsessed with proper credit assignment. This draft is the result of an experiment in rapid massive open online peer review. Since 20 September 2013, subsequent revisions published under http://www.deeplearning.me have absorbed many suggestions for improvements by experts

arXiv:1309.0261 [pdf, other]

Multi-Column Deep Neural Networks for Offline Handwritten Chinese Character Classification

Authors: Dan Cireşan, Jürgen Schmidhuber

Abstract: Our Multi-Column Deep Neural Networks achieve best known recognition rates on Chinese characters from the ICDAR 2011 and 2013 offline handwriting competitions, approaching human performance. Our Multi-Column Deep Neural Networks achieve best known recognition rates on Chinese characters from the ICDAR 2011 and 2013 offline handwriting competitions, approaching human performance. △ Less

Submitted 1 September, 2013; originally announced September 2013.

Comments: 5 pages, 1 figure, IDSIA tech report

Report number: IDSIA-05-13

arXiv:1305.0423 [pdf, other]

Testing Hypotheses by Regularized Maximum Mean Discrepancy

Authors: Somayeh Danafar, Paola M. V. Rancoita, Tobias Glasmachers, Kevin Whittingstall, Juergen Schmidhuber

Abstract: Do two data samples come from different distributions? Recent studies of this fundamental problem focused on embedding probability distributions into sufficiently rich characteristic Reproducing Kernel Hilbert Spaces (RKHSs), to compare distributions by the distance between their embeddings. We show that Regularized Maximum Mean Discrepancy (RMMD), our novel measure for kernel-based hypothesis tes… ▽ More Do two data samples come from different distributions? Recent studies of this fundamental problem focused on embedding probability distributions into sufficiently rich characteristic Reproducing Kernel Hilbert Spaces (RKHSs), to compare distributions by the distance between their embeddings. We show that Regularized Maximum Mean Discrepancy (RMMD), our novel measure for kernel-based hypothesis testing, yields substantial improvements even when sample sizes are small, and excels at hypothesis tests involving multiple comparisons with power control. We derive asymptotic distributions under the null and alternative hypotheses, and assess power control. Outstanding results are obtained on: challenging EEG data, MNIST, the Berkley Covertype, and the Flare-Solar dataset. △ Less

Submitted 2 May, 2013; originally announced May 2013.

arXiv:1302.1700 [pdf, other]

Fast Image Scanning with Deep Max-Pooling Convolutional Neural Networks

Authors: Alessandro Giusti, Dan C. Cireşan, Jonathan Masci, Luca M. Gambardella, Jürgen Schmidhuber

Abstract: Deep Neural Networks now excel at image classification, detection and segmentation. When used to scan images by means of a sliding window, however, their high computational complexity can bring even the most powerful hardware to its knees. We show how dynamic programming can speedup the process by orders of magnitude, even when max-pooling layers are present. Deep Neural Networks now excel at image classification, detection and segmentation. When used to scan images by means of a sliding window, however, their high computational complexity can bring even the most powerful hardware to its knees. We show how dynamic programming can speedup the process by orders of magnitude, even when max-pooling layers are present. △ Less

Submitted 7 February, 2013; originally announced February 2013.

Comments: 11 pages, 2 figures, 3 tables, 21 references, submitted to ICIP 2013

Report number: IDSIA-01-13

Journal ref: International Conference on Image Processing (ICIP) 2013, Melbourne

arXiv:1302.1690 [pdf, other]

A Fast Learning Algorithm for Image Segmentation with Max-Pooling Convolutional Networks

Authors: Jonathan Masci, Alessandro Giusti, Dan Cireşan, Gabriel Fricout, Jürgen Schmidhuber

Abstract: We present a fast algorithm for training MaxPooling Convolutional Networks to segment images. This type of network yields record-breaking performance in a variety of tasks, but is normally trained on a computationally expensive patch-by-patch basis. Our new method processes each training image in a single pass, which is vastly more efficient. We validate the approach in different scenarios and r… ▽ More We present a fast algorithm for training MaxPooling Convolutional Networks to segment images. This type of network yields record-breaking performance in a variety of tasks, but is normally trained on a computationally expensive patch-by-patch basis. Our new method processes each training image in a single pass, which is vastly more efficient. We validate the approach in different scenarios and report a 1500-fold speed-up. In an application to automated steel defect detection and segmentation, we obtain excellent performance with short training times. △ Less

Submitted 7 February, 2013; originally announced February 2013.

arXiv:1212.6521 [pdf, other]

A Frequency-Domain Encoding for Neuroevolution

Authors: Jan Koutník, Juergen Schmidhuber, Faustino Gomez

Abstract: Neuroevolution has yet to scale up to complex reinforcement learning tasks that require large networks. Networks with many inputs (e.g. raw video) imply a very high dimensional search space if encoded directly. Indirect methods use a more compact genotype representation that is transformed into networks of potentially arbitrary size. In this paper, we present an indirect method where networks are… ▽ More Neuroevolution has yet to scale up to complex reinforcement learning tasks that require large networks. Networks with many inputs (e.g. raw video) imply a very high dimensional search space if encoded directly. Indirect methods use a more compact genotype representation that is transformed into networks of potentially arbitrary size. In this paper, we present an indirect method where networks are encoded by a set of Fourier coefficients which are transformed into network weight matrices via an inverse Fourier-type transform. Because there often exist network solutions whose weight matrices contain regularity (i.e. adjacent weights are correlated), the number of coefficients required to represent these networks in the frequency domain is much smaller than the number of weights (in the same way that natural images can be compressed by ignore high-frequency components). This "compressed" encoding is compared to the direct approach where search is conducted in the weight space on the high-dimensional octopus arm task. The results show that representing networks in the frequency domain can reduce the search-space dimensionality by as much as two orders of magnitude, both accelerating convergence and yielding more general solutions. △ Less

Submitted 28 December, 2012; originally announced December 2012.

arXiv:1212.2546 [pdf, other]

A Learning Framework for Morphological Operators using Counter-Harmonic Mean

Authors: Jonathan Masci, Jesús Angulo, Jürgen Schmidhuber

Abstract: We present a novel framework for learning morphological operators using counter-harmonic mean. It combines concepts from morphology and convolutional neural networks. A thorough experimental validation analyzes basic morphological operators dilation and erosion, opening and closing, as well as the much more complex top-hat transform, for which we report a real-world application from the steel indu… ▽ More We present a novel framework for learning morphological operators using counter-harmonic mean. It combines concepts from morphology and convolutional neural networks. A thorough experimental validation analyzes basic morphological operators dilation and erosion, opening and closing, as well as the much more complex top-hat transform, for which we report a real-world application from the steel industry. Using online learning and stochastic gradient descent, our system learns both the structuring element and the composition of operators. It scales well to large datasets and online settings. △ Less

Submitted 11 December, 2012; originally announced December 2012.

Comments: Submitted to ISMM'13

arXiv:1210.8385 [pdf, other]

First Experiments with PowerPlay

Authors: Rupesh Kumar Srivastava, Bas R. Steunebrink, Jürgen Schmidhuber

Abstract: Like a scientist or a playing child, PowerPlay not only learns new skills to solve given problems, but also invents new interesting problems by itself. By design, it continually comes up with the fastest to find, initially novel, but eventually solvable tasks. It also continually simplifies or compresses or speeds up solutions to previous tasks. Here we describe first experiments with PowerPlay. A… ▽ More Like a scientist or a playing child, PowerPlay not only learns new skills to solve given problems, but also invents new interesting problems by itself. By design, it continually comes up with the fastest to find, initially novel, but eventually solvable tasks. It also continually simplifies or compresses or speeds up solutions to previous tasks. Here we describe first experiments with PowerPlay. A self-delimiting recurrent neural network SLIM RNN is used as a general computational problem solving architecture. Its connection weights can encode arbitrary, self-delimiting, halting or non-halting programs affecting both environment (through effectors) and internal states encoding abstractions of event sequences. Our PowerPlay-driven SLIM RNN learns to become an increasingly general solver of self-invented problems, continually adding new problem solving procedures to its growing skill repertoire. Extending a recent conference paper, we identify interesting, emerging, developmental stages of our open-ended system. We also show how it automatically self-modularizes, frequently re-using code for previously invented skills, always trying to invent novel tasks that can be quickly validated because they do not require too many weight changes affecting too many previous tasks. △ Less

Submitted 31 October, 2012; originally announced October 2012.

Comments: 13 pages, 6 figures. Extends preliminary work presented at ICDL-EpiRob 2012

arXiv:1210.0118 [pdf, ps, other]

Self-Delimiting Neural Networks

Authors: Juergen Schmidhuber

Abstract: Self-delimiting (SLIM) programs are a central concept of theoretical computer science, particularly algorithmic information & probability theory, and asymptotically optimal program search (AOPS). To apply AOPS to (possibly recurrent) neural networks (NNs), I introduce SLIM NNs. Neurons of a typical SLIM NN have threshold activation functions. During a computational episode, activations are spreadi… ▽ More Self-delimiting (SLIM) programs are a central concept of theoretical computer science, particularly algorithmic information & probability theory, and asymptotically optimal program search (AOPS). To apply AOPS to (possibly recurrent) neural networks (NNs), I introduce SLIM NNs. Neurons of a typical SLIM NN have threshold activation functions. During a computational episode, activations are spreading from input neurons through the SLIM NN until the computation activates a special halt neuron. Weights of the NN's used connections define its program. Halting programs form a prefix code. The reset of the initial NN state does not cost more than the latest program execution. Since prefixes of SLIM programs influence their suffixes (weight changes occurring early in an episode influence which weights are considered later), SLIM NN learning algorithms (LAs) should execute weight changes online during activation spreading. This can be achieved by applying AOPS to growing SLIM NNs. To efficiently teach a SLIM NN to solve many tasks, such as correctly classifying many different patterns, or solving many different robot control tasks, each connection keeps a list of tasks it is used for. The lists may be efficiently updated during training. To evaluate the overall effect of currently tested weight changes, a SLIM NN LA needs to re-test performance only on the efficiently computable union of tasks potentially affected by the current weight changes. Future SLIM NNs will be implemented on 3-dimensional brain-like multi-processor hardware. Their LAs will minimize task-specific total wire length of used connections, to encourage efficient solutions of subtasks by subsets of neurons that are physically close. The novel class of SLIM NN LAs is currently being probed in ongoing experiments to be reported in separate papers. △ Less

Submitted 29 September, 2012; originally announced October 2012.

Comments: 15 pages

Report number: IDSIA-08-12

arXiv:1209.6048 [pdf, other]

Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices

Authors: Yi Sun, Faustino Gomez, Juergen Schmidhuber

Abstract: We present a new way of converting a reversible finite Markov chain into a non-reversible one, with a theoretical guarantee that the asymptotic variance of the MCMC estimator based on the non-reversible chain is reduced. The method is applicable to any reversible chain whose states are not connected through a tree, and can be interpreted graphically as inserting vortices into the state transition… ▽ More We present a new way of converting a reversible finite Markov chain into a non-reversible one, with a theoretical guarantee that the asymptotic variance of the MCMC estimator based on the non-reversible chain is reduced. The method is applicable to any reversible chain whose states are not connected through a tree, and can be interpreted graphically as inserting vortices into the state transition graph. Our result confirms that non-reversible chains are fundamentally better than reversible ones in terms of asymptotic performance, and suggests interesting directions for further improving MCMC. △ Less

Submitted 26 September, 2012; originally announced September 2012.

Comments: Published in NIPS 2010

arXiv:1209.5853 [pdf, other]

Efficient Natural Evolution Strategies

Authors: Yi Sun, Daan Wierstra, Tom Schaul, Juergen Schmidhuber

Abstract: Efficient Natural Evolution Strategies (eNES) is a novel alternative to conventional evolutionary algorithms, using the natural gradient to adapt the mutation distribution. Unlike previous methods based on natural gradients, eNES uses a fast algorithm to calculate the inverse of the exact Fisher information matrix, thus increasing both robustness and performance of its evolution gradient estimatio… ▽ More Efficient Natural Evolution Strategies (eNES) is a novel alternative to conventional evolutionary algorithms, using the natural gradient to adapt the mutation distribution. Unlike previous methods based on natural gradients, eNES uses a fast algorithm to calculate the inverse of the exact Fisher information matrix, thus increasing both robustness and performance of its evolution gradient estimation, even in higher dimensions. Additional novel aspects of eNES include optimal fitness baselines and importance mixing (a procedure for updating the population with very few fitness evaluations). The algorithm yields competitive results on both unimodal and multimodal benchmarks. △ Less

Submitted 26 September, 2012; originally announced September 2012.

Comments: Puslished in GECCO'2009

arXiv:1207.1765 [pdf, other]

Object Recognition with Multi-Scale Pyramidal Pooling Networks

Authors: Jonathan Masci, Ueli Meier, Gabriel Fricout, Jürgen Schmidhuber

Abstract: We present a Multi-Scale Pyramidal Pooling Network, featuring a novel pyramidal pooling layer at multiple scales and a novel encoding layer. Thanks to the former the network does not require all images of a given classification task to be of equal size. The encoding layer improves generalisation performance in comparison to similar neural network architectures, especially when training data is sca… ▽ More We present a Multi-Scale Pyramidal Pooling Network, featuring a novel pyramidal pooling layer at multiple scales and a novel encoding layer. Thanks to the former the network does not require all images of a given classification task to be of equal size. The encoding layer improves generalisation performance in comparison to similar neural network architectures, especially when training data is scarce. We evaluate and compare our system to convolutional neural networks and state-of-the-art computer vision methods on various benchmark datasets. We also present results on industrial steel defect classification, where existing architectures are not applicable because of the constraint on equally sized input images. The proposed architecture can be seen as a fully supervised hierarchical bag-of-features extension that is trained online and can be fine-tuned for any given task. △ Less

Submitted 7 July, 2012; originally announced July 2012.

arXiv:1207.1522 [pdf, other]

Multimodal similarity-preserving hashing

Authors: Jonathan Masci, Michael M. Bronstein, Alexander A. Bronstein, Jürgen Schmidhuber

Abstract: We introduce an efficient computational framework for hashing data belonging to multiple modalities into a single representation space where they become mutually comparable. The proposed approach is based on a novel coupled siamese neural network architecture and allows unified treatment of intra- and inter-modality similarity learning. Unlike existing cross-modality similarity learning approaches… ▽ More We introduce an efficient computational framework for hashing data belonging to multiple modalities into a single representation space where they become mutually comparable. The proposed approach is based on a novel coupled siamese neural network architecture and allows unified treatment of intra- and inter-modality similarity learning. Unlike existing cross-modality similarity learning approaches, our hashing functions are not limited to binarized linear projections and can assume arbitrarily complex forms. We show experimentally that our method significantly outperforms state-of-the-art hashing approaches on multimedia retrieval tasks. △ Less

Submitted 6 July, 2012; originally announced July 2012.

arXiv:1206.4623 [pdf]

On the Size of the Online Kernel Sparsification Dictionary

Authors: Yi Sun, Faustino Gomez, Juergen Schmidhuber

Abstract: We analyze the size of the dictionary constructed from online kernel sparsification, using a novel formula that expresses the expected determinant of the kernel Gram matrix in terms of the eigenvalues of the covariance operator. Using this formula, we are able to connect the cardinality of the dictionary with the eigen-decay of the covariance operator. In particular, we show that under certain tec… ▽ More We analyze the size of the dictionary constructed from online kernel sparsification, using a novel formula that expresses the expected determinant of the kernel Gram matrix in terms of the eigenvalues of the covariance operator. Using this formula, we are able to connect the cardinality of the dictionary with the eigen-decay of the covariance operator. In particular, we show that under certain technical conditions, the size of the dictionary will always grow sub-linearly in the number of data points, and, as a consequence, the kernel linear regressor constructed from the resulting dictionary is consistent. △ Less

Submitted 18 June, 2012; originally announced June 2012.

Comments: ICML2012

arXiv:1202.2745 [pdf, other]

Multi-column Deep Neural Networks for Image Classification

Authors: Dan Cireşan, Ueli Meier, Juergen Schmidhuber

Abstract: Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neur… ▽ More Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks. △ Less

Submitted 13 February, 2012; originally announced February 2012.

Comments: 20 pages, 14 figures, 8 tables

Report number: IDSIA-04-12

Journal ref: CVPR 2012, p. 3642-3649

arXiv:1201.0292 [pdf, other]

T-Learning

Authors: Vincent Graziano, Faustino Gomez, Mark Ring, Juergen Schmidhuber

Abstract: Traditional Reinforcement Learning (RL) has focused on problems involving many states and few actions, such as simple grid worlds. Most real world problems, however, are of the opposite type, Involving Few relevant states and many actions. For example, to return home from a conference, humans identify only few subgoal states such as lobby, taxi, airport etc. Each valid behavior connecting two such… ▽ More Traditional Reinforcement Learning (RL) has focused on problems involving many states and few actions, such as simple grid worlds. Most real world problems, however, are of the opposite type, Involving Few relevant states and many actions. For example, to return home from a conference, humans identify only few subgoal states such as lobby, taxi, airport etc. Each valid behavior connecting two such states can be viewed as an action, and there are trillions of them. Assuming the subgoal identification problem is already solved, the quality of any RL method---in real-world settings---depends less on how well it scales with the number of states than on how well it scales with the number of actions. This is where our new method T-Learning excels, by evaluating the relatively few possible transits from one state to another in a policy-independent way, rather than a huge number of state-action pairs, or states in traditional policy-dependent ways. Illustrative experiments demonstrate that performance improvements of T-Learning over Q-learning can be arbitrarily large. △ Less

Submitted 31 December, 2011; originally announced January 2012.

arXiv:1112.6291 [pdf, other]

Descriptor learning for omnidirectional image matching

Authors: Jonathan Masci, Davide Migliore, Michael M. Bronstein, Jürgen Schmidhuber

Abstract: Feature matching in omnidirectional vision systems is a challenging problem, mainly because complicated optical systems make the theoretical modelling of invariance and construction of invariant feature descriptors hard or even impossible. In this paper, we propose learning invariant descriptors using a training set of similar and dissimilar descriptor pairs. We use the similarity-preserving hashi… ▽ More Feature matching in omnidirectional vision systems is a challenging problem, mainly because complicated optical systems make the theoretical modelling of invariance and construction of invariant feature descriptors hard or even impossible. In this paper, we propose learning invariant descriptors using a training set of similar and dissimilar descriptor pairs. We use the similarity-preserving hashing framework, in which we are trying to map the descriptor data to the Hamming space preserving the descriptor similarity on the training set. A neural network is used to solve the underlying optimization problem. Our approach outperforms not only straightforward descriptor matching, but also state-of-the-art similarity-preserving hashing methods. △ Less

Submitted 29 December, 2011; originally announced December 2011.

arXiv:1112.5309 [pdf, ps, other]

POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem

Authors: Jürgen Schmidhuber

Abstract: Most of computer science focuses on automatically solving given computational problems. I focus on automatically inventing or discovering problems in a way inspired by the playful behavior of animals and humans, to train a more and more general problem solver from scratch in an unsupervised fashion. Consider the infinite set of all computable descriptions of tasks with possibly computable solution… ▽ More Most of computer science focuses on automatically solving given computational problems. I focus on automatically inventing or discovering problems in a way inspired by the playful behavior of animals and humans, to train a more and more general problem solver from scratch in an unsupervised fashion. Consider the infinite set of all computable descriptions of tasks with possibly computable solutions. The novel algorithmic framework POWERPLAY (2011) continually searches the space of possible pairs of new tasks and modifications of the current problem solver, until it finds a more powerful problem solver that provably solves all previously learned tasks plus the new one, while the unmodified predecessor does not. Wow-effects are achieved by continually making previously learned skills more efficient such that they require less time and space. New skills may (partially) re-use previously learned skills. POWERPLAY's search orders candidate pairs of tasks and solver modifications by their conditional computational (time & space) complexity, given the stored experience so far. The new task and its corresponding task-solving skill are those first found and validated. The computational costs of validating new tasks need not grow with task repertoire size. POWERPLAY's ongoing search for novelty keeps breaking the generalization abilities of its present solver. This is related to Goedel's sequence of increasingly powerful formal theories based on adding formerly unprovable statements to the axioms without affecting previously provable theorems. The continually increasing repertoire of problem solving procedures can be exploited by a parallel search for solutions to additional externally posed tasks. POWERPLAY may be viewed as a greedy but practical implementation of basic principles of creativity. A first experimental analysis can be found in separate papers [53,54]. △ Less

Submitted 4 November, 2012; v1 submitted 22 December, 2011; originally announced December 2011.

Comments: 21 pages, additional connections to previous work, references to first experiments with POWERPLAY

arXiv:1112.2113 [pdf, other]

Incremental Slow Feature Analysis: Adaptive and Episodic Learning from High-Dimensional Input Streams

Authors: Varun Raj Kompella, Matthew Luciw, Juergen Schmidhuber

Abstract: Slow Feature Analysis (SFA) extracts features representing the underlying causes of changes within a temporally coherent high-dimensional raw sensory input signal. Our novel incremental version of SFA (IncSFA) combines incremental Principal Components Analysis and Minor Components Analysis. Unlike standard batch-based SFA, IncSFA adapts along with non-stationary environments, is amenable to episod… ▽ More Slow Feature Analysis (SFA) extracts features representing the underlying causes of changes within a temporally coherent high-dimensional raw sensory input signal. Our novel incremental version of SFA (IncSFA) combines incremental Principal Components Analysis and Minor Components Analysis. Unlike standard batch-based SFA, IncSFA adapts along with non-stationary environments, is amenable to episodic training, is not corrupted by outliers, and is covariance-free. These properties make IncSFA a generally useful unsupervised preprocessor for autonomous learning agents and robots. In IncSFA, the CCIPCA and MCA updates take the form of Hebbian and anti-Hebbian updating, extending the biological plausibility of SFA. In both single node and deep network versions, IncSFA learns to encode its input streams (such as high-dimensional video) by informative slow features representing meaningful abstract environmental properties. It can handle cases where batch SFA fails. △ Less

Submitted 9 December, 2011; originally announced December 2011.

Journal ref: Neural Computation, 2012, Vol. 24, No. 11, Pages 2994-3024

arXiv:1109.1314 [pdf, ps, other]

Measuring Intelligence through Games

Authors: Tom Schaul, Julian Togelius, Jürgen Schmidhuber

Abstract: Artificial general intelligence (AGI) refers to research aimed at tackling the full problem of artificial intelligence, that is, create truly intelligent agents. This sets it apart from most AI research which aims at solving relatively narrow domains, such as character recognition, motion planning, or increasing player satisfaction in games. But how do we know when an agent is truly intelligent? A… ▽ More Artificial general intelligence (AGI) refers to research aimed at tackling the full problem of artificial intelligence, that is, create truly intelligent agents. This sets it apart from most AI research which aims at solving relatively narrow domains, such as character recognition, motion planning, or increasing player satisfaction in games. But how do we know when an agent is truly intelligent? A common point of reference in the AGI community is Legg and Hutter's formal definition of universal intelligence, which has the appeal of simplicity and generality but is unfortunately incomputable. Games of various kinds are commonly used as benchmarks for "narrow" AI research, as they are considered to have many important properties. We argue that many of these properties carry over to the testing of general intelligence as well. We then sketch how such testing could practically be carried out. The central part of this sketch is an extension of universal intelligence to deal with finite time, and the use of sampling of the space of games expressed in a suitably biased game description language. △ Less

Submitted 6 September, 2011; originally announced September 2011.

arXiv:1106.4487 [pdf, ps, other]

Natural Evolution Strategies

Authors: Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jürgen Schmidhuber

Abstract: This paper presents Natural Evolution Strategies (NES), a recent family of algorithms that constitute a more principled approach to black-box optimization than established evolutionary algorithms. NES maintains a parameterized distribution on the set of solution candidates, and the natural gradient is used to update the distribution's parameters in the direction of higher expected fitness. We intr… ▽ More This paper presents Natural Evolution Strategies (NES), a recent family of algorithms that constitute a more principled approach to black-box optimization than established evolutionary algorithms. NES maintains a parameterized distribution on the set of solution candidates, and the natural gradient is used to update the distribution's parameters in the direction of higher expected fitness. We introduce a collection of techniques that address issues of convergence, robustness, sample complexity, computational complexity and sensitivity to hyperparameters. This paper explores a number of implementations of the NES family, ranging from general-purpose multi-variate normal distributions to heavy-tailed and separable distributions tailored towards global optimization and search in high dimensional spaces, respectively. Experimental results show best published performance on various standard benchmarks, as well as competitive performance on others. △ Less

Submitted 22 June, 2011; originally announced June 2011.

arXiv:1106.1998 [pdf, other]

A Linear Time Natural Evolution Strategy for Non-Separable Functions

Authors: Yi Sun, Faustino Gomez, Tom Schaul, Juergen Schmidhuber

Abstract: We present a novel Natural Evolution Strategy (NES) variant, the Rank-One NES (R1-NES), which uses a low rank approximation of the search distribution covariance matrix. The algorithm allows computation of the natural gradient with cost linear in the dimensionality of the parameter space, and excels in solving high-dimensional non-separable problems, including the best result to date on the Rosenb… ▽ More We present a novel Natural Evolution Strategy (NES) variant, the Rank-One NES (R1-NES), which uses a low rank approximation of the search distribution covariance matrix. The algorithm allows computation of the natural gradient with cost linear in the dimensionality of the parameter space, and excels in solving high-dimensional non-separable problems, including the best result to date on the Rosenbrock function (512 dimensions). △ Less

Submitted 13 June, 2011; v1 submitted 10 June, 2011; originally announced June 2011.

arXiv:1103.5708 [pdf, other]

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments

Authors: Yi Sun, Faustino Gomez, Juergen Schmidhuber

Abstract: To maximize its success, an AGI typically needs to explore its initially unknown world. Is there an optimal way of doing so? Here we derive an affirmative answer for a broad class of environments. To maximize its success, an AGI typically needs to explore its initially unknown world. Is there an optimal way of doing so? Here we derive an affirmative answer for a broad class of environments. △ Less

Submitted 29 March, 2011; originally announced March 2011.

arXiv:1103.4487 [pdf, other]

Handwritten Digit Recognition with a Committee of Deep Neural Nets on GPUs

Authors: Dan C. Cireşan, Ueli Meier, Luca M. Gambardella, Jürgen Schmidhuber

Abstract: The competitive MNIST handwritten digit recognition benchmark has a long history of broken records since 1998. The most recent substantial improvement by others dates back 7 years (error rate 0.4%) . Recently we were able to significantly improve this result, using graphics cards to greatly speed up training of simple but deep MLPs, which achieved 0.35%, outperforming all the previous more complex… ▽ More The competitive MNIST handwritten digit recognition benchmark has a long history of broken records since 1998. The most recent substantial improvement by others dates back 7 years (error rate 0.4%) . Recently we were able to significantly improve this result, using graphics cards to greatly speed up training of simple but deep MLPs, which achieved 0.35%, outperforming all the previous more complex methods. Here we report another substantial improvement: 0.31% obtained using a committee of MLPs. △ Less

Submitted 23 March, 2011; originally announced March 2011.

Comments: 9 pages, 4 figures, 3 tables

Report number: IDSIA-03-11

arXiv:1102.0183 [pdf, other]

High-Performance Neural Networks for Visual Object Classification

Authors: Dan C. Cireşan, Ueli Meier, Jonathan Masci, Luca M. Gambardella, Jürgen Schmidhuber

Abstract: We present a fast, fully parameterizable GPU implementation of Convolutional Neural Network variants. Our feature extractors are neither carefully designed nor pre-wired, but rather learned in a supervised way. Our deep hierarchical architectures achieve the best published results on benchmarks for object classification (NORB, CIFAR10) and handwritten digit recognition (MNIST), with error rates of… ▽ More We present a fast, fully parameterizable GPU implementation of Convolutional Neural Network variants. Our feature extractors are neither carefully designed nor pre-wired, but rather learned in a supervised way. Our deep hierarchical architectures achieve the best published results on benchmarks for object classification (NORB, CIFAR10) and handwritten digit recognition (MNIST), with error rates of 2.53%, 19.51%, 0.35%, respectively. Deep nets trained by simple back-propagation perform better than more shallow ones. Learning is surprisingly rapid. NORB is completely trained within five epochs. Test error rates on MNIST drop to 2.42%, 0.97% and 0.48% after 1, 3 and 17 epochs, respectively. △ Less

Submitted 1 February, 2011; originally announced February 2011.

Comments: 12 pages, 2 figures, 5 tables

Report number: IDSIA 1-11

arXiv:1009.2634 [pdf, other]

Evolution of National Nobel Prize Shares in the 20th Century

Authors: Juergen Schmidhuber

Abstract: We analyze the evolution of cumulative national shares of Nobel Prizes since 1901, properly taking into account that most prizes were divided among several laureates. We rank by citizenship at the moment of the award, and by country of birth. Surprisingly, graphs of this type have not been published before, even though they powerfully illustrate the century's migration patterns (brain drains and g… ▽ More We analyze the evolution of cumulative national shares of Nobel Prizes since 1901, properly taking into account that most prizes were divided among several laureates. We rank by citizenship at the moment of the award, and by country of birth. Surprisingly, graphs of this type have not been published before, even though they powerfully illustrate the century's migration patterns (brain drains and gains) in the sciences and other fields. △ Less

Submitted 14 September, 2010; originally announced September 2010.

Comments: 19 pages, 17 figures

arXiv:1003.0358 [pdf, other]

doi 10.1162/NECO_a_00052

Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition

Authors: Dan Claudiu Ciresan, Ueli Meier, Luca Maria Gambardella, Juergen Schmidhuber

Abstract: Good old on-line back-propagation for plain multi-layer perceptrons yields a very low 0.35% error rate on the famous MNIST handwritten digits benchmark. All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images, and graphics cards to greatly speed up learning. Good old on-line back-propagation for plain multi-layer perceptrons yields a very low 0.35% error rate on the famous MNIST handwritten digits benchmark. All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images, and graphics cards to greatly speed up learning. △ Less

Submitted 1 March, 2010; originally announced March 2010.

Comments: 14 pages, 2 figures, 4 listings

Journal ref: Neural Computation, Volume 22, Number 12, December 2010

arXiv:0812.4360 [pdf, ps, other]

Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes

Authors: Juergen Schmidhuber

Abstract: I argue that data becomes temporarily interesting by itself to some self-improving, but computationally limited, subjective observer once he learns to predict or compress the data in a better way, thus making it subjectively simpler and more beautiful. Curiosity is the desire to create or discover more non-random, non-arbitrary, regular data that is novel and surprising not in the traditional se… ▽ More I argue that data becomes temporarily interesting by itself to some self-improving, but computationally limited, subjective observer once he learns to predict or compress the data in a better way, thus making it subjectively simpler and more beautiful. Curiosity is the desire to create or discover more non-random, non-arbitrary, regular data that is novel and surprising not in the traditional sense of Boltzmann and Shannon but in the sense that it allows for compression progress because its regularity was not yet known. This drive maximizes interestingness, the first derivative of subjective beauty or compressibility, that is, the steepness of the learning curve. It motivates exploring infants, pure mathematicians, composers, artists, dancers, comedians, yourself, and (since 1990) artificial systems. △ Less

Submitted 15 April, 2009; v1 submitted 23 December, 2008; originally announced December 2008.

Comments: 35 pages, 3 figures, based on KES 2008 keynote and ALT 2007 / DS 2007 joint invited lecture

Journal ref: Short version: J. Schmidhuber. Simple Algorithmic Theory of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes. Journal of SICE 48(1), 21-32, 2009

arXiv:0807.1494 [pdf, ps, other]

doi 10.1007/978-3-642-13800-3_7

Algorithm Selection as a Bandit Problem with Unbounded Losses

Authors: Matteo Gagliolo, Juergen Schmidhuber

Abstract: Algorithm selection is typically based on models of algorithm performance, learned during a separate offline training sequence, which can be prohibitively expensive. In recent work, we adopted an online approach, in which a performance model is iteratively updated and used to guide selection on a sequence of problem instances. The resulting exploration-exploitation trade-off was represented as a… ▽ More Algorithm selection is typically based on models of algorithm performance, learned during a separate offline training sequence, which can be prohibitively expensive. In recent work, we adopted an online approach, in which a performance model is iteratively updated and used to guide selection on a sequence of problem instances. The resulting exploration-exploitation trade-off was represented as a bandit problem with expert advice, using an existing solver for this game, but this required the setting of an arbitrary bound on algorithm runtimes, thus invalidating the optimal regret of the solver. In this paper, we propose a simpler framework for representing algorithm selection as a bandit problem, with partial information, and an unknown bound on losses. We adapt an existing solver to this game, proving a bound on its expected regret, which holds also for the resulting algorithm selection technique. We present preliminary experiments with a set of SAT solvers on a mixed SAT-UNSAT benchmark. △ Less

Submitted 9 July, 2008; originally announced July 2008.

Comments: 15 pages, 2 figures

Report number: IDSIA-07-08 ACM Class: F.2.2; G.3; I.1.2; I.2.6; I.2.8

arXiv:0804.3269 [pdf, ps, other]

Phoneme recognition in TIMIT with BLSTM-CTC

Authors: Santiago Fernández, Alex Graves, Juergen Schmidhuber

Abstract: We compare the performance of a recurrent neural network with the best results published so far on phoneme recognition in the TIMIT database. These published results have been obtained with a combination of classifiers. However, in this paper we apply a single recurrent neural network to the same task. Our recurrent neural network attains an error rate of 24.6%. This result is not significantly… ▽ More We compare the performance of a recurrent neural network with the best results published so far on phoneme recognition in the TIMIT database. These published results have been obtained with a combination of classifiers. However, in this paper we apply a single recurrent neural network to the same task. Our recurrent neural network attains an error rate of 24.6%. This result is not significantly different from that obtained by the other best methods, but they rely on a combination of classifiers for achieving comparable performance. △ Less

Submitted 21 April, 2008; originally announced April 2008.

Comments: 8 pages

Report number: IDSIA-04-08 ACM Class: I.2.7; I.5.4

arXiv:0709.0674 [pdf, ps, other]

Simple Algorithmic Principles of Discovery, Subjective Beauty, Selective Attention, Curiosity & Creativity

Authors: Juergen Schmidhuber

Abstract: I postulate that human or other intelligent agents function or should function as follows. They store all sensory observations as they come - the data is holy. At any time, given some agent's current coding capabilities, part of the data is compressible by a short and hopefully fast program / description / explanation / world model. In the agent's subjective eyes, such data is more regular and m… ▽ More I postulate that human or other intelligent agents function or should function as follows. They store all sensory observations as they come - the data is holy. At any time, given some agent's current coding capabilities, part of the data is compressible by a short and hopefully fast program / description / explanation / world model. In the agent's subjective eyes, such data is more regular and more "beautiful" than other data. It is well-known that knowledge of regularity and repeatability may improve the agent's ability to plan actions leading to external rewards. In absence of such rewards, however, known beauty is boring. Then "interestingness" becomes the first derivative of subjective beauty: as the learning agent improves its compression algorithm, formerly apparently random data parts become subjectively more regular and beautiful. Such progress in compressibility is measured and maximized by the curiosity drive: create action sequences that extend the observation history and yield previously unknown / unpredictable but quickly learnable algorithmic regularity. We discuss how all of the above can be naturally implemented on computers, through an extension of passive unsupervised learning to the case of active data selection: we reward a general reinforcement learner (with access to the adaptive compressor) for actions that improve the subjective compressibility of the growing data. An unusually large breakthrough in compressibility deserves the name "discovery". The "creativity" of artists, dancers, musicians, pure mathematicians can be viewed as a by-product of this principle. Several qualitative examples support this hypothesis. △ Less

Submitted 5 September, 2007; originally announced September 2007.

Comments: 15 pages, 3 highly compressible low-complexity drawings. Joint Invited Lecture for Algorithmic Learning Theory (ALT 2007) and Discovery Science (DS 2007), Sendai, Japan, 2007

ACM Class: I.2.0

arXiv:0709.0670 [pdf, ps, other]

Using Data Compressors to Construct Rank Tests

Authors: Daniil Ryabko, Juergen Schmidhuber

Abstract: Nonparametric rank tests for homogeneity and component independence are proposed, which are based on data compressors. For homogeneity testing the idea is to compress the binary string obtained by ordering the two joint samples and writing 0 if the element is from the first sample and 1 if it is from the second sample and breaking ties by randomization (extension to the case of multiple samples… ▽ More Nonparametric rank tests for homogeneity and component independence are proposed, which are based on data compressors. For homogeneity testing the idea is to compress the binary string obtained by ordering the two joint samples and writing 0 if the element is from the first sample and 1 if it is from the second sample and breaking ties by randomization (extension to the case of multiple samples is straightforward). $H_0$ should be rejected if the string is compressed (to a certain degree) and accepted otherwise. We show that such a test obtained from an ideal data compressor is valid against all alternatives. Component independence is reduced to homogeneity testing by constructing two samples, one of which is the first half of the original and the other is the second half with one of the components randomly permuted. △ Less

Submitted 5 September, 2007; originally announced September 2007.

Journal ref: Applied Mathematics Letters, 22:7, 1029-1032, 2009

arXiv:0708.4311 [pdf, ps, other]

2006: Celebrating 75 years of AI - History and Outlook: the Next 25 Years

Authors: Juergen Schmidhuber

Abstract: When Kurt Goedel layed the foundations of theoretical computer science in 1931, he also introduced essential concepts of the theory of Artificial Intelligence (AI). Although much of subsequent AI research has focused on heuristics, which still play a major role in many practical AI applications, in the new millennium AI theory has finally become a full-fledged formal science, with important opti… ▽ More When Kurt Goedel layed the foundations of theoretical computer science in 1931, he also introduced essential concepts of the theory of Artificial Intelligence (AI). Although much of subsequent AI research has focused on heuristics, which still play a major role in many practical AI applications, in the new millennium AI theory has finally become a full-fledged formal science, with important optimality results for embodied agents living in unknown environments, obtained through a combination of theory a la Goedel and probability theory. Here we look back at important milestones of AI history, mention essential recent results, and speculate about what we may expect from the next 25 years, emphasizing the significance of the ongoing dramatic hardware speedups, and discussing Goedel-inspired, self-referential, self-improving universal problem solvers. △ Less

Submitted 31 August, 2007; originally announced August 2007.

Comments: 14 pages; preprint of invited contribution to the Proceedings of the ``50th Anniversary Summit of Artificial Intelligence'' at Monte Verita, Ascona, Switzerland, 9-14 July 2006

ACM Class: I.2.0

arXiv:0705.2011 [pdf, other]

Multi-Dimensional Recurrent Neural Networks

Authors: Alex Graves, Santiago Fernandez, Juergen Schmidhuber

Abstract: Recurrent neural networks (RNNs) have proved effective at one dimensional sequence learning tasks, such as speech and online handwriting recognition. Some of the properties that make RNNs suitable for such tasks, for example robustness to input warping, and the ability to access contextual information, are also desirable in multidimensional domains. However, there has so far been no direct way o… ▽ More Recurrent neural networks (RNNs) have proved effective at one dimensional sequence learning tasks, such as speech and online handwriting recognition. Some of the properties that make RNNs suitable for such tasks, for example robustness to input warping, and the ability to access contextual information, are also desirable in multidimensional domains. However, there has so far been no direct way of applying RNNs to data with more than one spatio-temporal dimension. This paper introduces multi-dimensional recurrent neural networks (MDRNNs), thereby extending the potential applicability of RNNs to vision, video processing, medical imaging and many other areas, while avoiding the scaling problems that have plagued other multi-dimensional models. Experimental results are provided for two image segmentation tasks. △ Less

Submitted 14 May, 2007; originally announced May 2007.

Comments: 10 pages, 10 figures

Report number: 04-07

arXiv:cs/0701120 [pdf, ps, other]

doi 10.1016/j.ic.2006.10.004

Algorithmic Complexity Bounds on Future Prediction Errors

Authors: A. Chernov, M. Hutter, J. Schmidhuber

Abstract: We bound the future loss when predicting any (computably) stochastic sequence online. Solomonoff finitely bounded the total deviation of his universal predictor $M$ from the true distribution $mu$ by the algorithmic complexity of $mu$. Here we assume we are at a time $t>1$ and already observed $x=x_1...x_t$. We bound the future prediction performance on $x_{t+1}x_{t+2}...$ by a new variant of al… ▽ More We bound the future loss when predicting any (computably) stochastic sequence online. Solomonoff finitely bounded the total deviation of his universal predictor $M$ from the true distribution $mu$ by the algorithmic complexity of $mu$. Here we assume we are at a time $t>1$ and already observed $x=x_1...x_t$. We bound the future prediction performance on $x_{t+1}x_{t+2}...$ by a new variant of algorithmic complexity of $mu$ given $x$, plus the complexity of the randomness deficiency of $x$. The new complexity is monotone in its condition in the sense that this complexity can only decrease if the condition is prolonged. We also briefly discuss potential generalizations to Bayesian model classes and to classification problems. △ Less

Submitted 19 January, 2007; originally announced January 2007.

Comments: 21 pages

Journal ref: Information and Computation, Vol.205,Nr.2 (2007) 242-261

arXiv:cs/0606081 [pdf, ps, other]

New Millennium AI and the Convergence of History

Authors: Juergen Schmidhuber

Abstract: Artificial Intelligence (AI) has recently become a real formal science: the new millennium brought the first mathematically sound, asymptotically optimal, universal problem solvers, providing a new, rigorous foundation for the previously largely heuristic field of General AI and embedded agents. At the same time there has been rapid progress in practical methods for learning true sequence-proces… ▽ More Artificial Intelligence (AI) has recently become a real formal science: the new millennium brought the first mathematically sound, asymptotically optimal, universal problem solvers, providing a new, rigorous foundation for the previously largely heuristic field of General AI and embedded agents. At the same time there has been rapid progress in practical methods for learning true sequence-processing programs, as opposed to traditional methods limited to stationary pattern association. Here we will briefly review some of the new results, and speculate about future developments, pointing out that the time intervals between the most notable events in over 40,000 years or 2^9 lifetimes of human history have sped up exponentially, apparently converging to zero within the next few decades. Or is this impression just a by-product of the way humans allocate memory space to past events? △ Less

Submitted 29 June, 2006; v1 submitted 19 June, 2006; originally announced June 2006.

Comments: Speed Prior: clarification / 15 pages, to appear in "Challenges to Computational Intelligence"

Report number: IDSIA-14-06 ACM Class: I.2

arXiv:cs/0603023 [pdf, ps, other]

Metric State Space Reinforcement Learning for a Vision-Capable Mobile Robot

Authors: Viktor Zhumatiy, Faustino Gomez, Marcus Hutter, Juergen Schmidhuber

Abstract: We address the problem of autonomously learning controllers for vision-capable mobile robots. We extend McCallum's (1995) Nearest-Sequence Memory algorithm to allow for general metrics over state-action trajectories. We demonstrate the feasibility of our approach by successfully running our algorithm on a real mobile robot. The algorithm is novel and unique in that it (a) explores the environmen… ▽ More We address the problem of autonomously learning controllers for vision-capable mobile robots. We extend McCallum's (1995) Nearest-Sequence Memory algorithm to allow for general metrics over state-action trajectories. We demonstrate the feasibility of our approach by successfully running our algorithm on a real mobile robot. The algorithm is novel and unique in that it (a) explores the environment and learns directly on a mobile robot without using a hand-made computer model as an intermediate step, (b) does not require manual discretization of the sensor input space, (c) works in piecewise continuous perceptual spaces, and (d) copes with partial observability. Together this allows learning from much less experience compared to previous methods. △ Less

Submitted 7 March, 2006; originally announced March 2006.

Comments: 14 pages, 8 figures

Report number: IDSIA-05-06

Journal ref: Proc. 9th International Conf. on Intelligent Autonomous Systems (IAS 2006) pages 272-281

arXiv:cs/0512062 [pdf, ps, other]

Evolino for recurrent support vector machines

Authors: Juergen Schmidhuber, Matteo Gagliolo, Daan Wierstra, Faustino Gomez

Abstract: Traditional Support Vector Machines (SVMs) need pre-wired finite time windows to predict and classify time series. They do not have an internal state necessary to deal with sequences involving arbitrary long-term dependencies. Here we introduce a new class of recurrent, truly sequential SVM-like devices with internal adaptive states, trained by a novel method called EVOlution of systems with KEr… ▽ More Traditional Support Vector Machines (SVMs) need pre-wired finite time windows to predict and classify time series. They do not have an internal state necessary to deal with sequences involving arbitrary long-term dependencies. Here we introduce a new class of recurrent, truly sequential SVM-like devices with internal adaptive states, trained by a novel method called EVOlution of systems with KErnel-based outputs (Evoke), an instance of the recent Evolino class of methods. Evoke evolves recurrent neural networks to detect and represent temporal dependencies while using quadratic programming/support vector regression to produce precise outputs. Evoke is the first SVM-based mechanism learning to classify a context-sensitive language. It also outperforms recent state-of-the-art gradient-based recurrent neural networks (RNNs) on various time series prediction tasks. △ Less

Submitted 15 December, 2005; originally announced December 2005.

Comments: 10 pages, 2 figures

Report number: IDSIA-19-05 version 2.0 ACM Class: F.1.1; I.2.6

arXiv:cs/0309048 [pdf, ps, other]

Goedel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements

Authors: Juergen Schmidhuber

Abstract: We present the first class of mathematically rigorous, general, fully self-referential, self-improving, optimally efficient problem solvers. Inspired by Kurt Goedel's celebrated self-referential formulas (1931), such a problem solver rewrites any part of its own code as soon as it has found a proof that the rewrite is useful, where the problem-dependent utility function and the hardware and the… ▽ More We present the first class of mathematically rigorous, general, fully self-referential, self-improving, optimally efficient problem solvers. Inspired by Kurt Goedel's celebrated self-referential formulas (1931), such a problem solver rewrites any part of its own code as soon as it has found a proof that the rewrite is useful, where the problem-dependent utility function and the hardware and the entire initial code are described by axioms encoded in an initial proof searcher which is also part of the initial code. The searcher systematically and efficiently tests computable proof techniques (programs whose outputs are proofs) until it finds a provably useful, computable self-rewrite. We show that such a self-rewrite is globally optimal - no local maxima! - since the code first had to prove that it is not useful to continue the proof search for alternative self-rewrites. Unlike previous non-self-referential methods based on hardwired proof searchers, ours not only boasts an optimal order of complexity but can optimally reduce any slowdowns hidden by the O()-notation, provided the utility of such speed-ups is provable at all. △ Less

Submitted 17 December, 2006; v1 submitted 25 September, 2003; originally announced September 2003.

Comments: 29 pages, 1 figure, minor improvements, updated references

Report number: IDSIA-19-03 ACM Class: F.4.1

Journal ref: Variants published in "Adaptive Agents and Multi-Agent Systems II", LNCS 3394, p. 1-23, Springer, 2005: ISBN 978-3-540-25260-3; as well as in Proc. ICANN 2005, LNCS 3697, p. 223-233, Springer, 2005 (plenary talk); as well as in "Artificial General Intelligence", Series: Cognitive Technologies, Springer, 2006: ISBN-13: 978-3-540-23733-4

arXiv:cs/0302012 [pdf, ps, other]

The New AI: General & Sound & Relevant for Physics

Authors: Juergen Schmidhuber

Abstract: Most traditional artificial intelligence (AI) systems of the past 50 years are either very limited, or based on heuristics, or both. The new millennium, however, has brought substantial progress in the field of theoretically optimal and practically feasible algorithms for prediction, search, inductive inference based on Occam's razor, problem solving, decision making, and reinforcement learning… ▽ More Most traditional artificial intelligence (AI) systems of the past 50 years are either very limited, or based on heuristics, or both. The new millennium, however, has brought substantial progress in the field of theoretically optimal and practically feasible algorithms for prediction, search, inductive inference based on Occam's razor, problem solving, decision making, and reinforcement learning in environments of a very general type. Since inductive inference is at the heart of all inductive sciences, some of the results are relevant not only for AI and computer science but also for physics, provoking nontraditional predictions based on Zuse's thesis of the computer-generated universe. △ Less

Submitted 27 November, 2003; v1 submitted 10 February, 2003; originally announced February 2003.

Comments: 23 pages, updated refs, added Goedel machine overview, corrected computing history timeline. To appear in B. Goertzel and C. Pennachin, eds.: Artificial General Intelligence

Report number: TR IDSIA-04-03 ACM Class: I.2

arXiv:cs/0207097 [pdf, ps, other]

Optimal Ordered Problem Solver

Authors: Juergen Schmidhuber

Abstract: We present a novel, general, optimally fast, incremental way of searching for a universal algorithm that solves each task in a sequence of tasks. The Optimal Ordered Problem Solver (OOPS) continually organizes and exploits previously found solutions to earlier tasks, efficiently searching not only the space of domain-specific algorithms, but also the space of search algorithms. Essentially we ex… ▽ More We present a novel, general, optimally fast, incremental way of searching for a universal algorithm that solves each task in a sequence of tasks. The Optimal Ordered Problem Solver (OOPS) continually organizes and exploits previously found solutions to earlier tasks, efficiently searching not only the space of domain-specific algorithms, but also the space of search algorithms. Essentially we extend the principles of optimal nonincremental universal search to build an incremental universal learner that is able to improve itself through experience. In illustrative experiments, our self-improver becomes the first general system that learns to solve all n disk Towers of Hanoi tasks (solution size 2^n-1) for n up to 30, profiting from previously solved, simpler tasks involving samples of a simple context free language. △ Less

Submitted 23 December, 2002; v1 submitted 31 July, 2002; originally announced July 2002.

Comments: 43 pages, 2 figures, short version at NIPS 2002 (added 1 figure and references; streamlined presentation)

Report number: IDSIA-12-02 ACM Class: I.2.2; I.2.6; I.2.8

Journal ref: Machine Learning, 54, 211-254, 2004.

arXiv:cs/0111060 [pdf, ps, other]

Gradient-based Reinforcement Planning in Policy-Search Methods

Authors: Ivo Kwee, Marcus Hutter, Juergen Schmidhuber

Abstract: We introduce a learning method called ``gradient-based reinforcement planning'' (GREP). Unlike traditional DP methods that improve their policy backwards in time, GREP is a gradient-based method that plans ahead and improves its policy before it actually acts in the environment. We derive formulas for the exact policy gradient that maximizes the expected future reward and confirm our ideas with… ▽ More We introduce a learning method called ``gradient-based reinforcement planning'' (GREP). Unlike traditional DP methods that improve their policy backwards in time, GREP is a gradient-based method that plans ahead and improves its policy before it actually acts in the environment. We derive formulas for the exact policy gradient that maximizes the expected future reward and confirm our ideas with numerical experiments. △ Less

Submitted 28 November, 2001; originally announced November 2001.

Comments: This is an extended version of the paper presented at the EWRL 2001 in Utrecht (The Netherlands)

Report number: 14-01 ACM Class: I.2; I.2.6; I.2.8

Showing 101–150 of 153 results for author: Schmidhuber, J