Instant Download Information Theory and The Brain Roland Baddeley PDF All Chapter
Instant Download Information Theory and The Brain Roland Baddeley PDF All Chapter
Instant Download Information Theory and The Brain Roland Baddeley PDF All Chapter
com
https://ebookgate.com/product/information-theory-
and-the-brain-roland-baddeley/
https://ebookgate.com/product/the-theory-of-information-and-
coding-2nd-edition-mceliece/
https://ebookgate.com/product/cosmopolitanism-in-context-
perspectives-from-international-law-and-political-theory-1st-
edition-roland-pierik-editor/
https://ebookgate.com/product/information-theory-a-concise-
introduction-hollos/
https://ebookgate.com/product/information-quality-management-
theory-and-applications-latif-al-hakim/
Handbook of brain theory and neural networks 2nd
Edition Michael A. Arbib
https://ebookgate.com/product/handbook-of-brain-theory-and-
neural-networks-2nd-edition-michael-a-arbib/
https://ebookgate.com/product/virtual-reality-and-medicine-james-
roland/
https://ebookgate.com/product/emergent-information-a-unified-
theory-of-information-framework-world-scientific-series-in-
information-studies-3-1st-edition-hofkirchner/
https://ebookgate.com/product/quantum-information-theory-and-the-
foundations-of-quantum-mechanics-1st-edition-christopher-g-
timpson/
https://ebookgate.com/product/price-theory-and-applications-
decisions-markets-and-information-7th-edition-jack-hirshleifer/
INFORMATION THEORY AND THE BRAIN
Information Theory and the Brain deals with a new and expanding area of
neuroscience which provides a framework for understanding neuronal proces-
sing. It is derived from a conference held in Newquay, UK, where a handful of
scientists from around the world met to discuss the topic. This book begins
with an introduction to the basic concepts of information theory and then
illustrates these concepts with examples from research over the last 40 years.
Throughout the book, the contributors highlight current research from four
different areas: (1) biological networks, including a review of information
theory based on models of the retina, understanding the operation of the insect
retina in terms of energy efficiency, and the relationship of image statistics and
image coding; (2) information theory and artificial networks, including inde-
pendent component-based networks and models of the emergence of orienta-
tion and ocular dominance maps; (3) information theory and psychology,
including clarity of speech models, information theory and connectionist mod-
els, and models of information theory and resource allocation; (4) formal
analysis, including chapters on modelling the hippocampus, stochastic reso-
nance, and measuring information density. Each part includes an introduction
and glossary covering basic concepts.
This book will appeal to graduate students and researchers in neuroscience
as well as computer scientists and cognitive scientists. Neuroscientists inter-
ested in any aspect of neural networks or information processing willfindthis a
very useful addition to the current literature in this rapidly growing field.
Edited by
ROLAND BADDELEY
University of Sussex
PETER HANCOCK
University of Stirling
PETER FOLDIAK
University of St. Andrews
CAMBRIDGE
UNIVERSITY PRESS
CAMBRIDGE UNIVERSITY PRESS
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, Sao Paulo, Delhi
Published in the United States of America by Cambridge University Press, New York
www.cambridge.org
Information on this title: www.cambridge.org/9780521631976
A catalogue record for this publication is available from the British Library
Preface xiii
XI
xii Contributors
Roland Baddeley
xin
Introductory Information Theory and the Brain
ROLAND BADDELEY
1.1 Introduction
Learning and using a new technique always takes time. Even if the question
initially seems very straightforward, inevitably technicalities rudely intrude.
Therefore before a researcher decides to use the methods information theory
provides, it is worth finding out if these set of tools are appropriate for the
task in hand.
In this chapter I will therefore provide only a few important formulae and
no rigorous mathematical proofs (Cover and Thomas (1991) is excellent in
this respect). Neither will I provide simple "how to" recipes (for the psychol-
ogist, even after nearly 40 years, Attneave (1959) is still a good introduction).
Instead, it is hoped to provide a non-mathematical introduction to the basic
concepts and, using examples from the literature, show the kind of questions
information theory can be used to address. If, after reading this and the
following chapters, the reader decides that the methods are inappropriate,
he will have saved time. If, on the other hand, the methods seem potentially
useful, it is hoped that this chapter provides a simplistic overview that will
alleviate the growing pains.
Information Theory and the Brain, edited by Roland Baddeley, Peter Hancock, and Peter Foldiak.
Copyright © 1999 Cambridge University Press. All rights reserved.
2 Roland Baddeley
The "amount of information" is exactly the same concept that we talked about for
years under the name "variance". [Miller, 1956]
The technical meaning of "information" is not radically different from the everyday
meaning; it is merely more precise. [Attneave, 1959]
The mutual information I{X\ Y) is the relative entropy between the joint distribution
and the product distribution p(x)p(y), i.e.,
„ P(x, y)
Entropy
The first important aspect to quantify is how ''uncertain" we are about the
input we have before we measure it. There is much less to communicate
about the page numbers in a two-page pamphlet than in the Encyclopedia
Britannica and, as the measure of this initial uncertainty, entropy measures
how many yes/no questions would be required on average to guess the state
of the world. Given that all pages are equally likely, the number of yes/no
questions required to guess the page flipped to in a two-page pamphlet would
be 1, and hence this would have an entropy (uncertainty) of 1 bit. For a 1024
(210) page book, 10 yes/no questions are required on average and the entropy
would be 10 bits. For a one-page book, you would not even need to ask a
question, so it would have 0 bits of entropy. As well as the number of
questions required to guess a signal, the entropy also measures the smallest
possible size that the information could be compressed to.
1.2. What Is Information Theory? 3
H = \og2N (1.1)
where N is the number of possible states of the world, and log2 means that
the logarithm is to the base 2.1 Simply put, the more pages in a book, the
more yes/no questions required to identify the page and the higher the
entropy. But rather than work in a measuring system based on "number of
pages", we work with logarithms. The reason for this is simply that in many
cases we will be dealing with multiple events. If the "page flipper" flips twice,
the number of possible combinations of word pages would be N x TV (the
numbers of states multiply). If instead we use logarithms, then the entropy of
two-page flips will simply be the sum of the individual entropies (if the
number of states multiply, their logarithms add). Addition is simpler than
multiplication so by working with logs, we make subsequent calculations
much simpler (we also make the numbers much more manageable; an
entropy of 25 bits is more memorable than a system of 33,554,432 states).
When all states of the world are not equally likely, then compression is
possible and fewer questions need (on average) to be asked to identify an
input. People often are biased page flippers, flipping more often to the middle
pages. A clever compression algorithm, or a wise asker of questions can use
this information to take, on average, fewer questions to identify the given
page. One of the main results of information theory is that given knowledge
of the probability of all events, the minimum number of questions on average
required to identify a given event (and smallest that the thing can be com-
pressed) is given by:
£> ^ (1.2)
where p(x) is the probability of event x. If all events are equally likely, this
reduces to equation 1.1. In all cases the value of equation 1.2 will always be
equal to (if all states are equally likely), or less than (if the probabilities are
not equal) the entropy as calculated using equation 1.1. This leads us to call a
distribution where all states are equally likely a maximum entropy distribu-
tion, a property we will come back to later in Section 1.5.
1
Logarithms to the base 2 are often used since this makes the "number of yes/no" interpretation
possible. Sometimes, for mathematical convenience, natural logarithms are used and the resulting
measurements are then expressed in nats. The conversion is simple with 1 bit = log(e)/ log(2)
nats % 0.69314718 nats.
4 Roland Baddeley
Information
So entropy is intuitively a measure of (the logarithm of) the number of states
the world could be in. If, after measuring the world, this uncertainty is
decreased (it can never be increased), then the amount of decrease tells us
how much we have learned. Therefore, the information is defined as the
difference between the uncertainty before and after making a measurement.
Using the probability theory notation of P(X\ Y) to indicate the probability
of X given knowledge of Y (conditional on), the mutual information
(I(X; Y)) between a measurement X and the input Y can be defined as:
I{X\ Y) = H(X) - H(X\ Y) (1.3)
With a bit of mathematical manipulation, we can also get the following
definitions where H(X, Y) is the entropy of all combination of inputs and
outputs (the joint distribution):
\H(
H(X)~ H(X\Y) (a)
I(X; Y)=\H(Y)-
H(Y)- H(Y\X) H(Y\X) (b)
(b) (1.4)
j j / ~vr\ i
11 ( J\. ) ~\~ H(Y)-H(X, Y) (c)
0.08
ahdywshcf
yfktwvnljk H(X)=SP(X)logl/P(X)
epuucdqld
fpyubferki
Figure 1.1. The most straightforward method to calculate entropy or mutual informa-
tion is direct estimation of the probability distributions (after Baddeley, 1956). One case
where this is appropriate is in using the entropy of subjects' random number generation
ability as a measure of cognitive load. The subject is asked to generate random digit
sequences in time with a metronome, either as the only task, or while simultaneously
performing a task such as card sorting. Depending on the difficulty of the other task and
the speed of generation, the "randomness" of the digits will decrease. The simplest way
to estimate entropy is to estimate the probability of different letters. Using this measure
of entropy, redundancy (entropy/maximum entropy) decreases linearly with generation
time, and also with the difficulty of the other task. This has subsequently proved a very
effective measure of cognitive load.
1.4. Practical Use of Information Theory
D
"H la
100hz 1000hz 10,000hz g
3 4 5 6 7 8 |
B U.U.U.t,! 12
I
100hz 1000hz 10,000hz
12345678 |1
(0
1 2 3 4
100hz 1000hz 10,000hz
Input Information
Figure 1.2. Estimating the "channel capacity" for tone discrimination (after Pollack,
1952, 1953). The subject is presented with a number of tones and asked to assign
numeric labels to them. Given only three tones (A), the subject has almost perfect
performance, but as the number of tones increase (B), performance rapidly deteriorates.
This is not primarily an early sensory constraint, as performance is similar when the
tones are tightly grouped (C). One way to analyse such data is to plot the transmitted
information as a function of the number of input stimuli (D). As can be seen, up until
about 2.5 bits, all the available information is transmitted, but when the input informa-
tion is above 2.5 bits, the excess information is lost. This limited capacity has been found
for many tasks and was of great interest in the 1960s.
Continuous Distributions
Given that the data are discrete, and we have enough data, then simply
estimating probability distributions presents few conceptual problems.
Unfortunately if we have continuous variables such as membrane potentials,
or reaction times, then we have a problem. While the entropy of a discrete
probability distribution is finite, the entropy of any continuous variable is
8 Roland Baddeley
infinite. One easy way to see this is that using a single real number between 0
and 1, we could very simply code the entire Encyclopedia Britannica. The first
two digits after the decimal place could represent the first letter; the second
two digits could represent the second letter, and so on. Given no constraint
on accuracy, this means that the entropy of a continuous variable is infinite.
Before giving up hope, it should be remembered that mutual information
as specified by equation 1.4 is the difference between two entropies. It turns
out that as long as there is some noise in the system (H(X\ Y) > 0), then the
difference between these two infinite entropies is finite. This makes the role of
noise vital in any information theory measurement of continuous variables.
One particular case is if both the signal and noise are Gaussian (i.e.
normally) distributed. In this case the mutual information between the signal
(s) and the noise-corrupted version (sn) is simply:
(1.5)
where o2signai is the variance of the signal, and o2noise is the variance of the noise.
This has the expected characteristics: the larger the signal relative to the noise,
the larger the amount of information transmitted; a doubling of the signal will
result in an approximately 1 bit increase in information transmission; and the
information transmitted will be independent of the unit of measurement.
It is important to note that the above expression is only valid when both
the signal and noise are Gaussian. While this is often a reasonable and
testable assumption because of the central limit theorem (basically, the
more things we add, usually the more Gaussian the system becomes), it is
still only an estimate and can underestimate the information (if the signal is
more Gaussian than the noise) or overestimate the information (if the noise is
more Gaussian than the signal).
A second problem concerns correlated signals. Often a signal will have
structure - for instance, it could vary only slowly over time. Alternatively,
we could have multiple measurements. If all these measurements are inde-
pendent, then the situation is simple - the entropies and mutual informations
simply add. If, on the other hand, the variables are correlated across time,
then some method is required to take these correlations into account. In an
extreme case if all the measurements were identical in both signal and noise,
the information from one such measurement would be the same as the com-
bined information from all: it is important to in some way deal with these
effects of correlation.
Perhaps the most common way to deal with this "correlated measure-
ments" problem is to transform the signal to the Fourier domain. This
method is used in a number of papers in this volume and the underlying
logic is described in Figure 1.3.
1.4. Practical Use of Information Theory
C)
10
i5
<
o 20 40 0 1 2 3
Original signal Frequency
20 40
Figure 1.3. Taking into account correlations in data by transforming to a new repre-
sentation. (A) shows a signal varying slowly as a function of time. Because the voltages
at different time steps are correlated, it is not possible to treat each time step as inde-
pendent and work out the information as the sum of the information values at different
time steps. One way to approach this problem is to transform the signal to a new
representation where all components are now uncorrelated. If the signal is Gaussian,
transforming to a Fourier series representation has this property. Here we represent the
original signal (A) as a sum of sines and cosines of different frequencies (B). While the
individual time measurements are correlated, if the signal is Gaussian, the amounts of
each Fourier components (C) will be uncorrelated. Therefore the mutual information
for the whole signal will simply be the sum of the information values for the individual
frequencies (and these can be calculated using equation 1.5).
The Fourier transform method always uses the same representation (in
terms of sines and cosines) independent of the data. In some cases, especially
when we do not have that much data, it may be more useful to choose a
representation which still has the uncorrelated property of the Fourier com-
ponents, but is optimised to represent a particular data set. One plausible
candidate for such a method is principal components analysis. Here a new set
of measurements, based on linear transformation of the original data, is used
to describe the data. The first component is the linear combination of the
original measurements that captures the maximum amount of variance. The
second component is formed by a linear combination of the original mea-
surements that captures as much of the variance as possible while being
orthogonal to the first component (and hence independent of the first com-
ponent if the signal is Gaussian). Further components can be constructed in a
similar manner. The main advantage over a Fourier-based representation is
10 Roland Baddelev
that more of the signal can be described using fewer descriptors and thus less
data is required to estimate the characteristics of the signal and noise.
Methods based on principal-component-based representations of spikes
trains have been applied to calculating the information transmitted by cor-
tical neurons (Richmond and Optican, 1990).
All the above methods rely on an assumption of Gaussian nature of the
signal, and if this is not true and there exist non-linear relationships between
the inputs and outputs, methods based on Fourier analysis or principal
components analysis can only give rather inaccurate estimates. One method
that can be applied in this case is to use a non-linear compression method to
generate a compressed representation before performing the information
estimation (see Figure 1.4).
Output
n units
c2 units
= Linear unit
h units
Input n units
Input
Figure 1.4. Using non-linear compression techniques for generating compact represen-
tations of data. Linear principal components analysis can be performed using the neural
network shown in (A) where a copy of the input is used as the target output. On
convergence, the weights from the n input units to the h coding units will span the
same space as the first h principal components and, given that the input is Gaussian,
the coding units will be a good representation of the signal. If, on the other hand, there
is non-Gaussian non-linear structure in the signals, this approach may not be optimal.
One possible approach to dealing with such non-linearity is to use a compression-based
algorithm to create a non-linear compressed representation of the signals. This can be
done using the non-linear generalisation of the simple network to allow non-linearities
in processing (shown in (B)). Again the network is trained to recreate its input from its
output, while transmitting the information through a bottleneck, but this time the data
is allowed to be transformed using an arbitrary non-linearity before coding. If there are
significant non-linearities in the data, the representation provided by the bottleneck
units may provide a better representation of the input than a principal-components-
based representation. (After Fotheringhame and Baddeley, 1997.)
1.4. Practical Use of Information Theory 11
Table 1.1. Estimating the entropy of English using an intelligent predictor (after Shannon,
1951).
T H E R E I S N o R E V E R S E
1 1 1 5 1 1 2 1 1 2 1 1 15 1 17 1 1 1 2
O N A M o T O R C Y C L E
1 3 2 1 2 2 7 1 1 1 1 4 1 1 1 1
Above is a short passage of text. Underneath each letter is the number of guesses required by a
person to guess that letter based only on knowledge of the previous letters. If the letters were
completely random (maximum entropy and no redundancy), the best predictor would take on
average 27/2 guesses (26 letters and a space) for every letter. If, on the other hand, there is complete
predictability, then a predictor would only require only one guess per letter. English is between
these two extremes and, using this method, Shannon estimated an entropy per letter of between 1.6
and 0.6 bits per letter. This contrasts with log 27 = 4.76 bits if every letter was equally likely and
independent. Technical details can be found in Shannon (1951) and Attneave (1959).
12 Roland Baddeley
A)
B) C)
Spike train
L
Neuron — —
an
D)
Neural
Prediction of input Network
E)
Walsh Patterns
Figure 1.5. Estimating neuronal information transfer rate using a neural network based
predictor (after Heller et al., 1995). A collection of 32 4x4 Walsh patterns (and their
contrast reversed versions) (A) were presented to awake Rhesus Macaque monkeys, and
the spike trains generated by neurons in VI and IT recorded (B and C). Using differ-
ently coded versions of these spike trains as input, a neural network (D) was trained
using the back-propagation algorithm to predict which Walsh pattern was presented.
Intuitively, if the spike train contains a lot of information about the input, then an
accurate prediction is possible, while if there is very little information then the spike
train will not allow accurate prediction of the input. Notice that (1) the calculated
information will be very dependent on the choice (and number of) of stimuli, and (2)
even though we are using a predictor, implicitly we are still estimating probability
distributions and hence we require large amounts of data to accurately estimate the
information. Using this method, it was claimed that the neurons only transmitted small
amounts of information (~ 0.5 bits), and that this information was contained not in the
exact timing of the spikes, but in a local "rate".
average firing rate, vectors representing the presence and absence of spikes,
various low-pass-filtered versions of the spike train, etc). These codified spike
trains were used to train a neural network to predict the visual stimulus that
was presented when the neurons generated these spikes. The accuracy of
these predictions, given some assumptions, can again be used to estimate
the mutual information between the visual input and the differently coded
spike trains estimated. For these neurons and stimuli, the information trans-
mission is relatively small (^ 0.5 bits s"1).
A) B) Basque
"I hereby undertake not Manx (Celtic)
to remove from the
library, or to mark, deface, I— English
or injure in anyway, any Estimate entropies and Dutch
volume, document, or — cross entropies using
other object belonging compression algorithm
techniques. German
to it or in its custody; Italian
not to bring into the Cluster using
Library or kindle " cross entropies
as distances Spanish
Figure 1.6. Estimating entropies and cross entropies using compression-based techni-
ques. The declaration of the Bodleian Library (Oxford) has been translated into more
than 50 languages (A). The entropy of these letter sequences can be estimated using the
size of a compressed version of the statement. If the code book derived by the algorithm
for one language is used to code another language, the size of the code book will reflect
the cross entropy (B). Hierarchical minimum distance cluster analysis, using these cross
entropies as a distances, can then be applied to this data (a small subset of the resulting
tree is shown (C)). This method can produce an automatic taxonomy of languages, and
has been shown to correspond very closely to those derived using more traditional
linguistic analysis (Juola, P., personal communication).
time mean that only the earliest algorithms simply performed compression,
but the concept behind later algorithms is essentially the same.)
More recently, this compression approach to entropy estimation has been
applied to automatically calculating linguistic taxonomies (Figure 1.6). The
entropy was calculated using a modified compression algorithm based on
Farach et al. (1995). Cross entropy was estimated using the compressed
length when the code book derived for one language was used to compress
another. Though methods based on compression have not been commonly
used in the theoretical neuroscience community (but see Redlich, 1993), they
provide at least interesting possibilities.
know very little more. The entropy (and hence the maximum amount of
information transmission) is maximal when the uncertainty is maximal,
and this occurs when both alternatives are equally likely. In this case we
want questions where "yes" is has the same probability as "no". For instance
a question such as "Is it in the first or second half of the book?" will generally
tell you more than "Is it page 2?". The entropy as a function of probability is
shown for a yes/no system (binary channel) in Figure 1.8.
When there are more possible signalling states than true and false, the
constraints become much more important. Figure 1.9 shows three of the
simplest cases of constraints and the nature of the outputs (if we have no
noise) that will maximise information transmission. It is interesting to note
that the spike trains of neurons are exponentially distributed as shown in
Figure 1.9(C), consistent with maximal information transmission subject to
an average firing rate constraint (Baddeley et al., 1997).
The Huge Data Requirement. Possibly the greatest problem with information
theory is its requirement for vast amounts of data if the results are to tell us
more about the data than about the assumptions used to calculate its value.
As mentioned in Section 1.4, estimating the probability of every three-letter
combination in English would require sufficient data to estimate 19,683 dif-
ferent probabilities. While this may actually be possible given the large num-
ber of books available electronically, to get a better approximation to
English, (say, eight-letter combinations), the amount of data required
0.5 1
Probability
Figure 1.8. The entropy of a binary random (Bernoulli) variable is a function of its
probability and maximum when its probability is 0.5 (when it has an entropy of 1 bit).
Intuitively, if a measurement is always false (or always true) then we are not uncertain of
its value. If instead it is true as often as not, then the uncertainty, and hence the entropy,
is maximised.
18 Roland Baddeley
0.01
Does the Receiver Know About the Input? Information theory makes some
strong assumptions about the system. In particular it assumes that the recei-
ver knows everything about the statistics of the input, and that these statistics
do not change over time (that the system is stationary). This assumption of
stationarity is often particularly unrealistic.
1.7 Conclusion
In this chapter it was hoped to convey an intuitive feel for the core concepts
of information theory: entropy and information. These concepts themselves
are straightforward, and a number of ways of applying them to calculate
information transmission in real systems were described. Such examples are
intended to guide the reader towards the domains that in the past have
proved amenable to information theoretic techniques. In particular it is
argued that some aspects of cortical computation can be understood in the
context of maximisation of transmitted information. The following chapters
contain a large number of further examples and, in combination with Cover
and Thomas (1991) and Rieke et al. (1997), it is hoped that the reader will
find this book helpful as a starting point in exploring how information theory
can be applied to new problem domains.
PART ONE
Biological Networks
21
22 Part One: Biological Networks
Glossary
ATP Adenosine triphosphate, the basic molecule involved in Kreb's cycle and
therefore involved in most biological metabolic activity. It therefore constitutes
a good biological measure of energy consumption in contrast to a physical
measure such as calories.
Autocorrelation The spatial autocorrelation refers to the expected correlation
across a set of images, of the image intensity of any two pixels as a function
distance and orientation. Often for convenience, one-dimensional slices are
used to describe how the correlation between two pixels decays as a function
of distance. It can in some cases be most simply calculated using Fourier-
transform-based techniques.
Bispectrum A generalisation of the power spectrum that, as well as capturing
the pairwise correlations between inputs, also captures three-way correlations.
It is therefore useful as a numerical technique for calculating the higher-order
regularities in natural images.
Channels A concept from the psychophysics of vision, where the outputs of a
number of neurally homogeneous mechanisms are grouped together for con-
venience. Particularly influential is the idea that vision can be understood in
terms of a number of independent spatial channels, each conveying informa-
tion about an image at different spatial scales. Not to be confused with the
standard information theory concept of a channel.
Difference of Gaussians (DoG) A simple numerical approximation to the recep-
tive field properties of retinal ganglion cells, and a key filter in a number of
computational approaches to vision. The spatial profile of the filter consists of
the difference between a narrow and high-amplitude Gaussian and a wide and
low-amplitude Gaussian, and has provided a reasonable model for physiolo-
gical data.
Factorial coding The concept that a good representation is one where all fea-
tures are completely independent. Given this the probability of any combina-
Part One: Biological Networks 23
Sparse coding A code where a given input is signalled by the activity of a very
small number of "features" out of a potentially much larger number.
Problems and Solutions in Early Visual Processing
BRIAN G. BURTON
2.1 Introduction
Part of the function of the neuron is communication. Neurons must com-
municate voltage signals to one another through their connections (synapses)
in order to coordinate their control of an animal's behaviour. It is for this
reason that information theory (Shannon and Weaver, 1949) represents a
promising framework in which to study the design of natural neural systems.
Nowhere is this more so than in the early stages of vision, involving the
retina, and in the vertebrate, the lateral geniculate nucleus and the primary
visual cortex. Not only are early visual systems well characterised physiolo-
gically, but we are also able to identify the ultimate "signal" (the visual
image) that is being transmitted and the constraints which are imposed on
its transmission. This allows us to suggest sensible objectives for early vision
which are open to direct testing. For example, in the vertebrate, the optic
nerve may be thought of as a limited-capacity channel. The number of gang-
lion cells projecting axons in the optic nerve is many times less than the
number of photoreceptors on the retina (Sterling, 1990). We might therefore
propose that one goal of retinal processing is to package information as
efficiently as possible so that as little as possible is lost (Barlow, 1961a).
Important to this argument is that we do not assume the retina is making
judgements concerning the relative values of different image components to
higher processing (Atick, 1992b). Information theory is a mathematical the-
ory of communication. It considers the goal of faithful and efficient transmis-
sion of a defined signal within a set of data. The more narrowly we need to
define this signal, the more certain we must be that this definition is correct
for information theory to be of use. Therefore, if we start making a priori
Information Theory and the Brain, edited by Roland Baddeley, Peter Hancock, and Peter Foldiak.
Copyright © 1999 Cambridge University Press. All rights reserved.
25
26 Brian G. Burton
assumptions about what features of the image are relevant for the animal's
needs, then we can be less confident in our conclusions. Fortunately, whilst
specialisation may be true of higher visual processing, in many species this is
probably not true for the retina. It is usually assumed that the early visual
system is designed to be flexible and to transmit as much of the image as
possible. This means that we may define two goals for early visual processing,
namely, noise reduction and redundancy reduction. We wish to suppress
noise so that a larger number of discriminable signals may be transmitted
by a single neuron and we wish to remove redundancy so that the full
representational potential of the system is realised.
These objectives are firmly rooted in information theory and we will see
that computational strategies for achieving them predict behaviour which
matches closely to that seen in early vision. I start with an examination of
the fly compound eye as this illustrates well the problems associated with
noise and possible solutions (see also Laughlin et al., Chapter 3 this volume).
It should become clear how noise and redundancy are interrelated. However,
most theoretical work on redundancy has concentrated on the vertebrate
visual system about which there is more contention. Inevitably, the debate
concerns the structure of the input, that is, the statistics of natural images.
This defines the redundancy and therefore the precise information theoretic
criteria that should be adopted in visual processing. It is this issue upon
which I wish to focus, with particular emphasis on spatial redundancy.
dence rates are high enough that we can use the Gaussian approximation to
the Poisson distribution, we may use Shannon's (Shannon and Weaver, 1949)
equation to define channel capacity. For an analogue neuron (such as a
photoreceptor), subject to Gaussian distributed noise, the SNR affects capa-
city (in bits s"1) as follows:
where S{v) and N{v) are the (temporal) power spectral densities of the opti-
mum driving stimulus and the noise respectively.
There are a number of ways in which the insect retina may cope with the
problem of input noise. Where the metabolic costs are justified, the length of
photoreceptors and hence the number of phototransduction units may be
increased to maximise quantum catch (Laughlin and McGinness, 1978).
Alternatively, at low light intensities, it may be beneficial to trade off tem-
poral resolving power for SNR to make optimum use of the neuron's limited
dynamic range. It has recently been found that the power spectrum of nat-
ural, time-varying images follows an inverse relationship with temporal fre-
quency (Dong and Atick, 1995a). Because noise power spectra are flat (van
Hateren, 1992a), this means that SNR declines with frequency. There is
therefore no advantage in transmitting high temporal frequencies at low
light intensity when signal cannot be distinguished from noise. Instead, the
retina may safely discard these to improve SNR at low frequencies and
maximise information rate. In the fly, this strategy is exemplified by the
second-order interneurons, the large monopolar cells (LMCs) which become
low-pass temporal filters at low SNR (Laughlin, 1994, rev.).
The problem of noise is not just limited to extrinsic noise.
Phototransduction, for example, is a quantum process and is inherently
noisy. More generally, however, synaptic transmission is a major source of
intrinsic noise and we wish to find ways in which synaptic SNR may be
improved. Based on very few assumptions, Laughlin et al. (1987) proposed
a simple model for the graded synaptic transmission between photoreceptors
and LMCs which predicts that synaptic SNR is directly proportional to
synaptic voltage gain. More precisely, if b describes the sensitivity of trans-
mitter release to presynaptic voltage and determines the maximum voltage
gain achievable across the synapse, then:
(2.2)
where AR is the change in receptor potential being signalled and T is the
present level of transmitter release (another Poisson process). That is, by
amplifying the receptor signal through b, it is possible to improve synaptic
SNR. However, because LMCs are under the same range constraints as
photoreceptors, such amplification may only be achieved through transient
28 Brian G. Burton
response properties and LMCs are phasic (Figure 2.1a). This is related to
redundancy reduction, since transmitting a signal that is not changing (a
tonic response) would not convey any information yet would use up the cell's
dynamic range. Furthermore, by amplifying the signal at the very earliest
stage of processing, the signal becomes more robust to noise corruption at
later stages. It may also be significant that this amplification occurs before
the generation of spikes (LMCs show graded responses). De Ruyter van
Steveninck and Laughlin (1996b) determined the optimum stimuli for driving
LMCs and found that their information capacity can reach five times that of
spiking neurons (see also Juusola and French, 1997). If signal amplification
were to take place after the first generation of spikes, this would not only be
energetically inefficient but might result in unnecessary loss of information.
receptor
A)
stimulus intensity
LMC
-0.5 0 0.5
contrast
stimulus
Figure 2.1. Responses of fly retinal LMC cells, (a) Response to tonic stimulation. While
photoreceptors show tonic activity in response to a sustained stimulus, LMC interneur-
ons show phasic response. This allows signal amplification and protection against noise.
Note, LMCs may be hyperpolarising because transmission is by electrotonus, not spike
generation. (From Laughlin et al., 1987, with permission from The Royal Society.)
(b) Matched coding. Natural images have a characteristic distribution of contrasts
(top) and corresponding cumulative probability (middle). The LMC synapse matches
its output to this cumulative probability curve to maximise information transmission
(bottom). (From Laughlin, 1987, with permission from Elsevier Science.)
2.2. Adaptations of the Insect Retina 29
As will be detailed later, there are two types of redundancy that should be
removed from a cell's response. Besides removing temporal correlations, the
cell should also utilise its different response levels with equal frequency. For a
channel with a limited range, a uniform distribution of outputs is the one
with the most entropy and therefore the one that may realise channel capa-
city. This principle too is demonstrated by the photoreceptor-LMC synapse.
Laughlin (1981) measured the relative frequencies of different levels of con-
trast in the fly's natural environment under daylight conditions and com-
pared the resulting histogram with the responses of LMCs to the range of
contrasts recorded. Remarkably, the input-output relationship of the LMC
followed the cumulative distribution observed in natural contrasts, just what
is predicted for entropy maximisation (Figure 2.1b). This behaviour may also
be seen as allowing the fly to discriminate between small changes in contrast
where they are most frequent, since the highest synaptic gain corresponds
with the modal contrast.
In summary, the work on insects, and in particular, the fly, has shown how
cellular properties may be exquisitely designed to meet information theoretic
criteria of efficiency. Indeed, LMC responses at different light intensities may
be predicted with striking accuracy merely on the assumption that the retina
is attempting to maximise information transmission through a channel of
limited dynamic range (van Hateren, 1992a). This is most clearly demon-
strated by the correspondence between the images recorded in the fly retina
and those predicted by theory (van Hateren, 1992b). In particular, the fly has
illustrated the problems associated with noise. However, it should be pointed
out that the design principles identified in flies may also be seen in the
vertebrate eye (Sterling, 1990). With the exception of ganglion cells (the
output neurons), the vertebrate retina also comprises almost exclusively
non-spiking neurons and one of its main functions appears to be to protect
against noise by eliminating redundant or noisy signal components and
boosting the remainder. For example, phasic retinal interneurons are argu-
ably performing the same function as the LMC and the slower responses of
photoreceptors at low light intensities may be an adaptation to low SNR. In
addition, the well-known centre-surround antagonistic receptive fields (RFs)
of ganglion cells first appear in the receptors themselves (Baylor et al., 1971).
This allows spatially redundant components to be removed (examined in
more detail below) and for the information-carrying elements to be amplified
before noise corruption at the first feed-forward synapse. Finally, there exist
on and off ganglion cells. This not only effectively increases the dynamic
range of the system and allows greater amplification of input signals, but
also provides equally reliable transmission of all contrasts. Because spike
generation is subject to Poisson noise, a single spiking cell which responds
monotonically to contrast will have a low SNR at one end of its input range
where its output is low. In a two-cell system, however, in which distinct cells
30 Brian G. Burton
where /(x) is the light intensity at position, x and (•) indicates averaging over
the ensemble of examples. In Fourier terms, this is expressed by the power
spectral density (Bendant and Piersol, 1986). When this is determined, the
relationship between signal power and spatial frequency, f, follows a distinct
l/|f|2 law (Burton and Moorhead, 1987; Field, 1987). If T[-] represents the
Fourier transformation, and L(f) the Fourier transform of /(x), then
oc - L (2.4)
|f)2
That is, as with temporal frequencies, there is less "energy" in the "signal" at
high frequencies and hence a lower SNR. More interestingly, such a relation-
ship signifies scale invariance. That is, the image appears the same at all
2.4. Theories for the RFs of Retinal Cells 31
11
IT
-100 0 100
Diagonal (/im)
Figure 2.2. Comparison between collective coding and predictive coding. (a,b) Collective
coding, (a) The ganglion cell RF is constructed by weighting the inputs from the sur-
rounding m photoreceptors according to their autocorrelation coefficients, r. (b) The
optimum RF profile (O), shown here across the diagonal of the RF, is found to be dome
shaped. This gives greater SNR than either a flat (F) or exponential (E) weighting
function. (From Tsukomoto et al., 1990, with permission from the author.)
(c) Predictive coding. At high SNR (top), the inhibitory surround of a model ganglion
cell is restricted. As SNR is lowered, the surround becomes more diffuse (middle) and
eventually subtracts an unweighted average of local image intensity (bottom). (From
Srinivasan et al., 1982, with permission from the Royal Society.) While collective coding
explains the centre of the RF, predictive coding explains the surround. However, neither
explain both.
although visual acuity drops off with eccentricity, the eye is designed to
obtain equally reliable signals from all parts of the retinal image.
The collective coding model is instructive. It shows how natural statistics
and the statistical independence of noise may be used to improve system
performance. However, whilst collective coding provides an appreciation
for the form of the RF across its centre, it does not satisfactorily address
2.4. Theories for the RFs of Retinal Cells 33
B) C)
300 300
100 100
30 30
10 10
3 3
10 30 100
Sj»ti«l fir«qu«»cy. c / d « |
Figure 2.3. Decorrelation in the retina, (a) Match with psychophysical experiments. For
a certain parameter regime, the predictions of Atick and Redlich (curves) fit very well
with psychophysical data (points), (b) Signal whitening. When the contrast sensitivity
curves (left) obtained from psychophysical experiments at high luminosity are multiplied
by the amplitude spectrum of natural images, the curve becomes flat at low frequencies.
This indicates that ganglion cells are indeed attempting to decorrelate their input. (From
Atick and Redlich, 1992, with permission from MIT Press Journals.)
2.4. Theories for the RFs of Retinal Cells 35
= 1 (2 7)
^ -
where /(y; x) is the mutual information between the output of the channel, y,
and the input, x and C(y) is the capacity, the maximum of/(y;x). Consider
the case when there is no input noise, no dimensionality reduction and input
follows Gaussian statistics. If processing is described by the linear transfor-
mation, A, then the value of C(y) is given by:
1 /|ARA + (^)l|\
c)
C(y) = -logV ' (2.8)
2
\ («?>i / argmax
where R is the autocorrelation matrix of the input and nc is channel noise
(note the similarity with equation 2.1). Now, if the output variances, (yf)
(diagonal terms of ARA r -f {nl)I), are fixed, then by the inequality,
|M| < ]~I/(M)/p Ay;x) m a Y o n ly equal the capacity when all the entries of
ARA , except those on the diagonal, are zero. That is, redundancy is
removed only when the output is decorrelated.
Besides this desirable feature, decorrelation also represents the first step
towards achieving a factorial code. In the limit, this would require that the
outputs of neurons were statistically independent regardless of the probabil-
ity distribution of inputs. That is, the probability that a particular level of
activity is observed in a given neuron is independent of the activity observed
at other neurons:
TAMIL MAN.
What a scene! I had now time to look around a little. All round
the little lake, thronging the steps and the sides in the great glare of
the torches, were hundreds of men and boys, barebodied, barehead
and barefoot, but with white loin-cloths—all in a state of great
excitement—not religious so much as spectacular, as at the
commencement of a theatrical performance, myself and companion
about the only persons clothed,—except that in a corner and forming
a pretty mass of color were a few women and girls, of the poorer
class of Tamils, but brightly dressed, with nose-rings and ear-rings
profusely ornamented. On the water, brilliant in scarlet and gold and
blue, was floating the sacred canopy, surrounded by musicians
yelling on their various horns, in the front of which—with the priest
standing between them—sat two little naked boys holding small
torches; while overhead through the leaves of plentiful coco-nut and
banana palms overhanging the tank, in the dim blue sky among
gorgeous cloud-outlines just discernible, shone the goddess of night,
the cause of all this commotion.
Such a blowing up of trumpets in the full moon! For the first
time I gathered some clear idea of what the ancient festivals were
like. Here was a boy blowing two pipes at the same time, exactly as
in the Greek bas-reliefs. There was a man droning a deep bourdon
on a reed instrument, with cheeks puffed into pouches with long-
sustained effort of blowing; to him was attached a shrill flageolet
player—the two together giving much the effect of Highland
bagpipes. Then there were the tomtoms, whose stretched skins
produce quite musical and bell-like though monotonous sounds; and
lastly two old men jingling cymbals and at the same time blowing
their terrible chank-horns or conches. These chanks are much used
in Buddhist and Hindu temples. They are large whorled sea-shells of
the whelk shape, such as sometimes ornament our mantels. The
apex of the spiral is cut away and a mouthpiece cemented in its
place, through which the instrument can be blown like a horn. If
then the fingers be used to partly cover and vary the mouth of the
shell, and at the same time the shell be vibrated to and fro in the air
—what with its natural convolutions and these added complications,
the most ear-rending and diabolically wavy bewildering and hollow
sounds can be produced, such as might surely infect the most
callous worshiper with a proper faith in the supernatural.
The temper of the crowd too helped one to understand the old
religious attitude. It was thoroughly whole-hearted—I cannot think
of any other word. There was no piety—in our sense of the word—or
very little, observable. They were just thoroughly enjoying
themselves—a little excited no doubt by chanks and divine
possibilities generally, but not subdued by awe; talking freely to each
other in low tones, or even indulging occasionally—the younger ones
—in a little bear-fighting; at the same time proud of the spectacle
and the presence of the divinity, heart and soul in the ceremony, and
anxious to lend hands as torch-bearers or image-bearers, or in any
way, to its successful issue. It is this temper which the wise men say
is encouraged and purposely cultivated by the ceremonial institutions
of Hinduism. The temple services are made to cover, as far as may
be, the whole ground of life, and to provide the pleasures of the
theatre, the art-gallery, the music hall and the concert-room in one.
People attracted by these spectacles—which are very numerous and
very varied in character, according to the different feasts—presently
remain to inquire into their meaning. Some like the music, others the
bright colors. Many men come at first merely to witness the dancing
of the nautch girls, but afterwards and insensibly are drawn into
spheres of more spiritual influence. Even the children find plenty to
attract them, and the temple becomes their familiar resort from early
life.
The theory is that all the ceremonies have inner and mystic
meanings—which meanings in due time are declared to those who
are fit—and that thus the temple institutions and ceremonies
constitute a great ladder by which men can rise at last to those inner
truths which lie beyond all formulas and are contained in no creed.
Such is the theory, but like all theories it requires large deductions
before acceptance. That such theory was one of the formative
influences of the Hindu ceremonial, and that the latter embodies
here and there important esoteric truths descending from Vedic
times, I hardly doubt; but on the other hand, time, custom and
neglect, different streams of tradition blending and blurring each
other, reforms and a thousand influences have—as in all such cases
—produced a total concrete result which no one theory can account
for or coordinate.
Such were some of my thoughts as I watched the crowd around
me. They too were not uninterested in watching me. The
appearance of an Englishman under such circumstances was
perhaps a little unusual and scores of black eyes were turned
inquiringly in my direction; but covered as I was by the authority of
my companion no one seemed to resent my presence. A few I
thought looked shocked, but the most seemed rather pleased, as if
proud that a spectacle so brilliant and impressive should be
witnessed by a stranger—besides there were two or three among
the crowd whom I happened to have met before and spoken with,
and whose friendly glances made me feel at home.
Meanwhile the gyrating raft had completed two or three voyages
round the little piece of water. Each time it returned to the shore
fresh offerings were made to the god, the bell was rung again, a
moment of hushed adoration followed, and then with fresh strains of
mystic music a new start for the deep took place. What the inner
signification of these voyages might be I had not and have not the
faintest idea; it is possible even that no one present knew. At the
same time I do not doubt that the drama was originally instituted in
order to commemorate some actual event or to symbolise some
doctrine. On each voyage a hymn was sung or recited. On the first
voyage the Brahman priest declaimed a hymn from the Vedas—a
hymn that may have been written 3,000 years ago—nor was there
anything in the whole scene which appeared to me discordant with
the notion that the clock had been put back 3,000 years (though of
course the actual new departure in the Brahmanical rites which we
call Hinduism does not date back anything like so far as that). On
the second voyage a Tamil hymn was sung by one of the youths
trained in the temples for this purpose; and on the third voyage
another Tamil hymn, with interludes of the most ecstatic
caterwauling from chanks and bagpipes! The remainder of the
voyages I did not witness, as my conductor now took me to visit the
interior of the temple.
That is, as far as it was permissible to penetrate. For the
Brahman priests who regulate these things, with far-sighted policy
make it one of their most stringent rules that the laity shall not have
access beyond a short distance into the temple, and heathen like
myself are of course confined to the mere forecourts. Thus the
people feel more awe and sanctity with regard to the holy place
itself and the priests who fearlessly tread within than they do with
regard to anything else connected with their religion.
Having passed the porch, we found ourselves in a kind of
entrance hall with one or two rows of columns supporting a flat
wooden roof—the walls adorned with the usual rude paintings of
various events in Siva’s earthly career. On the right was a kind of
shrine with a dancing figure of the god in relief—the perpetual dance
of creation; but unlike some of the larger temples, in which there is
often most elaborate and costly stonework, everything here was of
the plainest, and there was hardly anything in the way of sculpture
to be seen. Out of this forecourt opened a succession of chambers
into which one might not enter; but the dwindling lights placed in
each served to show distance after distance. In the extreme
chamber farthest removed from the door, by which alone daylight
enters—the rest of the interior being illumined night and day with
artificial lights—is placed, surrounded by lamps, the most sacred
object, the lingam. This of course was too far off to be discerned—
and indeed it is, except on occasions, kept covered—but it appears
that instead of being a rude image of the male organ (such as is
frequently seen in the outer courts of these temples), the thing is a
certain white stone, blue-veined and of an egg-shape, which is
mysteriously fished up—if the gods so will it—from the depths of the
river Nerbudda, and only thence. It stands in the temple in the
hollow of another oval-shaped object which represents the female
yoni; and the two together, embleming Siva and Sakti, stand for the
sexual energy which pervades creation.
Thus the worship of sex is found to lie at the root of the present
Hinduism, as it does at the root of nearly all the primitive religions of
the world. Yet it would be a mistake to conclude that such worship is
a mere deification of material functions. Whenever it may have been
that the Vedic prophets descending from Northern lands into India
first discovered within themselves that capacity of spiritual ecstasy
which has made them even down to to-day one of the greatest
religious forces in the world, it is certain that they found (as indeed
many of the mediæval Christian seers at a later time also found) that
this ecstasy had a certain similarity to the sexual rapture. In their
hands therefore the rude, phallic worships, which their predecessors
had with true instinct celebrated, came to have a new meaning; and
sex itself, the most important of earthly functions, came to derive an
even greater importance from its relation to the one supreme and
heavenly fact, that of the soul’s union with God.
In the middle line of all Hindu temples, between the lingam and
the door, are placed two other very sacred objects—the couchant
bull Nandi and an upright ornamented pole, the Kampam, or as it is
sometimes called, the flagstaff. In this case the bull was about four
feet in length, carved in one block of stone, which from continual
anointing by pious worshipers had become quite black and lustrous
on the surface. In the great temple at Tanjore there is a bull twenty
feet long cut from a single block of syenite, and similar bull-images
are to be found in great numbers in these temples, and of all sizes
down to a foot in length, and in any accessible situation are sure to
be black and shining with oil. In Tamil the word pasu signifies both
ox—i.e. the domesticated ox—and the soul. Siva is frequently
represented as riding on a bull; and the animal represents the
human soul which has become subject and affiliated to the god. As
to the flagstaff, it was very plain, and appeared to be merely a
wooden pole nine inches or so thick, slightly ornamented, and
painted a dull red color. In the well-known temple at Mádura the
kampam is made of teak plated with gold, and is encircled with
certain rings at intervals, and at the top three horizontal arms
project, with little bell-like tassels hanging from them. This curious
object has, it is said, a physiological meaning, and represents a
nerve which passes up the median line of the body from the genital
organs to the brain (? the great sympathetic). Indeed the whole
disposition of the parts in these temples is supposed (as of course
also in the Christian Churches) to represent the human body, and so
also the universe of which the human body is only the miniature. I
do not feel myself in a position however to judge how far these
correspondences are exact. The inner chambers in this particular
temple were, as far as I could see, very plain and unornamented.
On coming out again into the open space in front of the porch,
my attention was directed to some low buildings which formed the
priests’ quarters. Two priests were attached to the temple, and a
separate cottage was intended for any traveling priest or lay
benefactor who might want accommodation within the precincts.
And now the second act of the sacred drama was commencing.
The god, having performed a sufficient number of excursions on the
tank, was being carried back with ceremony to the space in front of
the porch—where for some time had been standing, on portable
platforms made of poles, three strange animal figures of more than
life-size—a bull, a peacock, and a black creature somewhat
resembling a hog, but I do not know what it was meant for. On the
back of the bull, which was evidently itself in an amatory and excited
mood, Siva and Sakti were placed; on the hog-like animal was
mounted another bejewelled figure—that of Ganésa, Siva’s son; and
on the peacock again the figure of his other son, Soubramánya.
Camphor flame was again offered, and then a lot of stalwart and
enthusiastic worshipers seized the poles, and mounting the
platforms on their shoulders set themselves to form a procession
round the temple on the grassy space between it and the outer wall.
The musicians as usual went first, then came the dancing girls, and
then after an interval of twenty or thirty yards the three animals
abreast of each other on their platforms, and bearing their
respective gods upon their backs. At this point we mingled with the
crowd and were lost among the worshipers. And now again I was
reminded of representations of antique religious processions. The
people, going in front or following behind, or partly filling the space
in front of the gods—though leaving a lane clear in the middle—were
evidently getting elated and excited. They swayed their arms, took
hands or rested them on each other’s bodies, and danced rather
than walked along; sometimes their shouts mixed with the music;
the tall torches swayed to and fro, flaring to the sky and distilling
burning drops on naked backs in a way which did not lessen the
excitement; the smell of hot coco-nut oil mingling with that of
humanity made the air sultry; and the great leaves of bananas and
other palms leaning over and glistening with the double lights of
moon and torch flames gave a weird and tropical beauty to the
2
scene. In this rampant way the procession moved for a few yards,
the men wrestling and sweating under the weight of the god-
images, which according to orthodox ideas are always made of an
alloy of the five metals known to the ancients—an alloy called
panchaloka—and are certainly immensely heavy; and then it came to
a stop. The bearers rested their poles on strong crutches carried for
the purpose, and while they took breath the turn of the nautch girls
came.
2
Mrs. Speir, in her Life in Ancient India, p.
374, says that we first hear of Siva worship
about b.c. 300, and that it is described by
Megasthenes as “celebrated in tumultuous
festivals, the worshippers anointing their bodies,
wearing crowns of flowers, and sounding bells
and cymbals. From which,” she adds, “the
Greeks conjectured that Siva worship was
derived from Bacchus or Dionysos, and carried to
the East in the traditionary expedition which
Bacchus made in company with Hercules.”
NAUTCH GIRL.
3
Or those ascribed to him.