[PDF][PDF] Convolutional networks for images, speech, and time series

Y LeCun, Y Bengio - The handbook of brain theory and neural networks, 1995 - Citeseer
The handbook of brain theory and neural networks, 1995Citeseer
The ability of multilayer back-propagation networks to learn complex, high-dimensional,
nonlinear mappings from large collections of examples makes them obvious candidates for
image recognition or speech recognition tasks (see PATTERN RECOGNITION AND
NEURAL NETWORKS). In the traditional model of pattern recognition, a hand-designed
feature extractor gathers relevant information from the input and eliminates irrelevant
variabilities. A trainable classi er then categorizes the resulting feature vectors (or strings of …
The ability of multilayer back-propagation networks to learn complex, high-dimensional, nonlinear mappings from large collections of examples makes them obvious candidates for image recognition or speech recognition tasks (see PATTERN RECOGNITION AND NEURAL NETWORKS). In the traditional model of pattern recognition, a hand-designed feature extractor gathers relevant information from the input and eliminates irrelevant variabilities. A trainable classi er then categorizes the resulting feature vectors (or strings of symbols) into classes. In this scheme, standard, fully-connected multilayer networks can be used as classi ers. A potentially more interesting scheme is to eliminate the feature extractor, feeding the network with\raw" inputs (eg normalized images), and to rely on backpropagation to turn the rst few layers into an appropriate feature extractor. While this can be done with an ordinary fully connected feed-forward network with some success for tasks such as character recognition, there are problems.
Firstly, typical images, or spectral representations of spoken words, are large, often with several hundred variables. A fully-connected rst layer with, say a few 100 hidden units, would already contain several 10,000 weights. Over tting problems may occur if training data is scarce. In addition, the memory requirement for that many weights may rule out certain hardware implementations. But, the main de ciency of unstructured nets for image or speech aplications is that they have no built-in invariance with respect to translations, or
Citeseer