Submitted to the British Machine Vision Conference 2004
Autonomous learning of perceptual categories and
symbolic protocols from audio-visual input
C. J. Needham, D. R. Magee, V. Devin
P. Santos, A. G. Cohn and D. C. Hogg
School of Computing
The University of Leeds
Leeds, LS2 9JT, UK
vision@comp.leeds.ac.uk
http://www.comp.leeds.ac.uk/vision
Abstract
The development of cognitive vision systems that autonomously learn how
to interact with their environment through input from sensors is a major challenge for the Computer Vision, Machine Learning and Artificial Intelligence
communities. This paper presents a framework in which a symbolic inference engine is integrated with a perceptual system. The rules of the inference
engine are learned from audio-visual observation of the world, to form an interactive perceptual agent that can respond suitably to its environment. From
the video and audio input streams interesting objects are identified using an
attention mechanism. Unsupervised clustering of the objects into perceptual
categories generates a symbolic data stream. Inductive Logic Programming
is then used to generalise rules from this symbolic data. Although in principle the framework could be applicable to a wide range of domains (perhaps
of most interest to the automatic programming of robots), it is demonstrated
here in simple game playing scenarios. First the agent observes humans playing a game, and then attempts to play using the learned perceptual categories
and symbolic protocols.
1 Introduction
The perceived world may be thought of as existing on two levels: the sensory level in
which meaning must be extracted from patterns in continuous observations, and the conceptual level in which the relationships between discrete concepts are represented and
evaluated. Making the link between these two levels is key to the development of artificial cognitive systems that can exhibit human-level qualities of perception, learning and
interaction. This is essentially the classic AI problem of “Symbol Grounding” [11]. The
ultimate aim of our work is fully autonomous learning of both continuous models, representing object properties, and symbolic models of temporal events, defining the implicit
temporal protocols present in many structured visual scenes.
Much work has been carried out in the separate areas of pattern recognition and model
building in continuous data (see for example [5]) and symbolic learning in various domains such as robotics/navigation [3], bioinformatics [21] and language [13]. Several
earlier systems have linked low-level video analysis systems with high-level (symbolic)
event analysis in an end-to-end system, such as the work of Siskind [20] that uses a handcrafted symbolic model of ‘Pickup’ and ‘Put-down’ events. This is extended in [6] to include a supervised symbolic event learning module, in which examples of particular event
types are presented to the learner. Moore and Essa [17] present a system for recognising
temporal events from video of the card game ‘blackjack’. In that work, multiple low-level
continuous temporal models (Hidden Markov Models), and object models (templates) are
learned using a supervised procedure, and activity is recognised using a hand defined
Stochastic Context-Free Grammar. A similar approach is used by Ivanov and Bobick [12]
for gesture recognition and surveillance scenarios. However, none of these systems is
capable of autonomous (unsupervised) learning of both continuous patterns and symbolic
concepts. The motivation behind our research is to learn both low-level continuous object models and high-level symbolic models from data in an arbitrary scenario with no
human interaction. Systems capable of unsupervised learning of both continuous models
of image patches and grammar-like (spatial) relations between image patches have been
presented by the static image analysis community (e.g. [1]). These involve the use of
general (non-scene specific) background knowledge of the type of relations that may be
important (e.g near, far, left-of, etc.). It is our aim to develop conceptually similar approaches for the analysis of dynamic video data. These would be similar to the grammars
used in [17, 12], which are currently hand defined. A perception-action learning approach
will be employed in order to learn an agent that can interact with the world. In the work
of [7], the link between visual perception of action and generation of the same action is
learned for a humanoid robot performing simple tasks. Initially the action is performed by
a human. The robot subsequently learns to mimic this action by experimentation. It can
then copy an action observed at a later time. Such learning would be a valuable addition
to our framework.
The contexts we envisage also require audio analysis and generation. Speech recognition and production software could be used to perform this task, normally requiring
supervised learning [4, 8, 19]; however an unsupervised approach to this task is favoured
to fit in with the philosophy of learning a cognitive agent. Such an approach also has the
advantage that participants can make non-word utterances (e.g. animal noises) or make
sounds using objects or instruments.
The proposed framework consists of three elements: an attention mechanism, unsupervised learning of perceptual categories (audible and visual), and symbolic learning of
temporal protocols. Figure 1 provides an overview of the learning phase of the framework. Egocentric learning is carried out, meaning the constructed models are based on
the behaviour of an agent with respect to the scenario, rather than being holistic models
of the complete scenario. Models of the protocols of the activities performed are learned,
as opposed to abstract descriptive rules (or even strategies), in order to easily drive the
behaviour of a synthetic agent that can interact with the real world in a near-natural way,
which is our aim.
Figure 1: Overview of the learning framework
1.1 Game playing
The domain of game-playing has been chosen as our application domain, since it is rich in
spatio-temporal rule-based protocols and it may be argued that many real-world social interaction scenarios may be modelled as games [10]. We have used the framework to learn
the objects, utterances, events and protocols involved in various simple games including
a version of “Snap”, played with dice, and a version of the game “Paper, Scissors, Stone”
played with cards. Typical setup is shown in Figure 1, and typical video input sequences
are shown in Figure 2. Descriptions of the two games are:
Snap. A simple, single player, two dice game based on the card game snap. The two
dice are rolled, one at a time. If the two dice show the same face, the player shouts “snap”
and utters the instruction “collect-two”. Both dice are picked up. Otherwise the player
utters “collect-one”, and the dice showing the lowest value face is picked up. Before
rolling the player utters the instruction “roll-two” or “roll-one”, depending on if there is a
dice already on the table.
Paper, Scissors, Stone (PSS). Two players simultaneously select one of the object
cards. Paper beats (wraps) stone, scissors beats (cuts) paper, and stone beats (blunts)
scissors. Our version of this game is played with picture cards, rather than hand gestures
for simplicity. Utterances (‘I win’, ‘I lose’, ‘draw’ and ‘go’) are spoken by the player to
be replaced by a synthetic agent.
2 Autonomous learning
The framework divides learning into three parts: attention, learning perceptual categories
and learning symbolic protocols. Figure 1 provides an overview of the learning phase.
To facilitate autonomous (fully unsupervised) learning, a spatio-temporal attention mechanism is required to determine ‘where’ and ‘when’ significant object occurrences and
interactions take place within the input video stream of the scenario to be learned from.
We are interested in learning protocols which depend upon different properties of objects.
In different situations, or games, different properties may be important. Such properties
may be texture, shape, colour, position, etc. For each object identified by the attention
mechanism a feature vector describing each property is extracted. Clusters are formed for
each property separately using a clustering algorithm. Classifying models are then built
using cluster membership as supervision. These models allow novel objects (identified
by the attention mechanism) to be assigned a class label for each property (texture, position, etc.). This symbolic stream is combined with the vocal utterances issued by the
player(s) participating in the game, which are extract and clustered from the audio signal
in a similar unsupervised manner. The symbolic stream is used as input for symbolic
learning (generalisation) based on the Progol Inductive Logic Programming system [18].
The output of the continuous classification methods can be presented in such a way that
instances of concepts such as equality, transitivity, symmetry, etc. may be generalised, in
addition to generalisations about the protocols of temporal change. Advantages of Progol’s learning approach are that learning can be performed using positive examples only,
and that even with noisy data (such as imperfect clustering/classification, or occasional
missing/additional objects), interesting rules are constructed.
The vocal utterances may either take the form of passive reactions (e.g. “snap”), or
active statements of intent (e.g. “roll-one”). The latter generates an implicit link between
the vocal utterance and the subsequent action in the data stream. Our high-level system
can learn this link, and thus an agent based on the learned model can generate these
utterances as a command to actively participate in its environment. It should be noted that
conceptually the framework does not limit the perception and generation of action to vocal
utterances; however a link is required between the perception and generation of individual
agent actions for learned models to be used in an interactive agent. Vocal utterances are
a good example of an action that can be perceived and generated without specialised
hardware. It was for this reason they were chosen in this example implementation.
2.1 Spatio-temporal attention for object localisation
Video streams of dynamic scenes contain huge quantities of data, much of which is irrelevant to scene learning and interpretation. An attention mechanism is required to identify
‘interesting’ parts of the stream, in terms of spatial location (‘where’) and temporal location (‘when’). For autonomous learning, models or heuristics are required to determine
what is of interest, and what is not. Such models could be based on motion, novelty, high
(or low) degree of spatial variation, or a number of other factors. It is highly likely that
no single factor could provide a generic attention mechanism for learning and interpretation in all scenarios. It is our view that attention from multiple cues is required for fully
generic learning.
For our implementation in the game-playing domain, an attention mechanism which
can identify salient areas of space and time is necessary. For this reason motion has
been chosen as it is straight-forward to work with. The spatial aspect of our attention
mechanism is based around a generic blob tracker [15] that works on the principle of
multi-modal (Gaussian mixture) background modelling, and foreground pixel grouping.
This identifies the centroid location, bounding box and pixel segmentation of any separable moving objects in the scene in each frame of the video sequence. The temporal aspect
of our attention mechanism identifies key-frames where there is qualitatively zero motion
for a number of frames (typically 3), which are preceded by a number of frames (typically
3) containing significant motion.
2.2 Continuous object learning and classification
In autonomous learning it is not in general possible to know a-priori what types of visual
(and other) object properties are important in determining object context within a dynamic
scene. For this reason the use of multiple (in fact large numbers of) features such as
colour, texture, shape, position, etc. is proposed. We group sets of features together into
hand defined semantic groups representing texture, position, etc. 1 In this way (initial)
feature selection within these semantic groups is performed during continuous learning,
and feature selection and context identification between the groups is performed during
the symbolic learning stage.
For each semantic group, a set of example feature vectors is partitioned into classes
using a graph partitioning method (an extension of [22]), which also acts as a feature
selection method within the semantic group (full details appear in [16]). The number of
clusters is chosen automatically based on a cluster compactness heuristic.
Once a set of examples is partitioned, the partitions may be used as supervision for
a conventional supervised statistical learning algorithm such as a Multi-Layer Perceptron, Radial Basis Function or Vector Quantisation based nearest neighbour classifier (the
latter is used in our implementation). This allows for the construction of models that encapsulate the information from the clustering in such a way that they can be easily and
efficiently applied to novel data. These models are used to generate training data suitable
for symbolic learning. For each object identified by the attention mechanism, a (symbolic) property is associated with it for each semantic group. Figure 2 shows example
‘static’ frames and corresponding symbolic data streams.
state([[tex3,pos0],[tex1,
pos1]],t318).
action(utter,[word6,word3],
t318).
time(t318).
successor(t310,t318).
collect-one
state([],t513).
action(utter,word4,t513).
time(t513).
successor(t510,t513).
play
state([[tex3,pos0]],t323).
action(utter,[word1,word3],
t323).
time(t323).
successor(t318,t323).
roll-one
state([[tex1,pos1],[tex1,
pos0]],t520).
action(utter,word3,t520).
time(t520).
successor(t513,t520).
win
state([[tex3,pos0],[tex3,
pos1]],t330).
action(utter,[word4,word5,
word2],t330).
time(t330).
successor(t323,t330).
snap-collect-two
state([],t524).
action(utter,word4,t524).
time(t524).
successor(t520,t524).
play
state([],t338).
action(utter,[word1,word2],
t338).
time(t338).
successor(t330,t338).
roll-two
state([[tex1,pos0],[tex2,
pos1]],t529).
action(utter,word1,t529).
time(t529).
successor(t524,t529).
draw
(a) Snap
(b) Paper-Scissors-Stone
Figure 2: Example audio-visual input and symbolic representation
1
This work uses a 96D rotationally invariant texture description vector (based on the statistics of banks of
Gabor wavelets and other related convolution based operations), and a 2D position vector only.
2.3 Attention, learning and classification for audio
The attention mechanism for the audio input is based on the energy of the signal. Nonoverlapping windows are formed each containing 512 samples, which is the power of 2
(needed for the Fourier transform) closest to a frame of video. (The audio is sampled at
8172Hz.) The energy for each window is calculated as the sum of absolute values of the
samples. The start of an utterance is detected when the energy of a window is greater than
a fixed threshold. Then, each utterance can be represented as a sequence of consecutive
windows for which the energy is over the threshold.
Spectrum analysis is performed on each detected window Wn , resulting in Sn the absolute value of the Fourier transform of Wn . The dimensionality of each spectrum Sn is
reduced2 from 512 to 17 by histogramming. Reducing the spectrum to this dimensionality makes the clustering robust to variations in the pitch of the voice [14]. Each utterance
detected is then represented by a temporal sequence of L reduced-dimensionality windows. L is chosen such that it is equal to the length of the shortest utterance (in windows)
in the training set. This is achieved by resampling the temporal sequence of reduceddimensionality windows which represent an utterance. The utterances are now of identical length, and kmeans clustering is performed on the set of utterances several times with
different numbers of clusters. The optimal number of clusters (C) to use is automatically
chosen such that the ratio between the mean distance of each utterance to the centre of
the closest cluster and the mean distance between all the cluster centres is minimised.
Using C clusters, each utterance of the training set is classified (nearest cluster centre) to
create a symbolic data stream as shown in Figure 2; thus an utterance may be represented
symbolically as one of word1,word2,.... This method for automatically choosing
the number of clusters does tend to over-cluster the data (too many clusters are created),
yet this is dealt with by creating equivalence classes between the utterances, as discussed
in Section 2.5.
2.4 Symbolic learning using Inductive Logic Programming
The previous sections described how models are learned that can convert continuous sensory input into a symbolic data stream in an unsupervised way. Learning models of the
spatio-temporal structure of the resultant (possibly noisy) symbolic streams obtained is
our goal, i.e. to learn a model of any implicit temporal protocols presented by the scene.
Structure in such streams differs greatly from the structure learned by our lower level processes, in that the data consists of variable numbers of objects (and thus a variable length
list of state descriptions is available). In addition, concepts such as reflexivity (equality of
certain properties), symmetry and transitivity exist in the scenarios. These concepts cannot be captured by purely statistical learning methods, such as those used for low-level
learning. An inductive logic programming approach is employed, implemented using
Progol [18]. Progol allows a set of positive examples to be generalised by inductively
subsuming the data representations by more general data representations/rules (with the
aim of reducing representational complexity, without over-generalising). Crucial in any
inductive learning approach is the way in which data is represented. Progol aims to reduce
representational complexity using a search procedure. In realistic scenarios, a search of all
2 The dimensionality is reduced to 17; this depends upon the rate that the audio is sampled (8172Hz), the
fundamental pitch frequency of the human voice (average around 275Hz) and the size of the window used (512
samples equivalent to windows at 16Hz). 275 16 is approximately 17.
possible data representations is not possible, and Progol must be guided by rules (mode
declarations) that define the general form of the solution. Figure 2 shows examples of
symbolic streams, from which rules are learned. For Progol, both the input data and output generalisations are in Prolog format, which allows straightforward incorporation of
these rules into a Prolog program (see Section 2.6).
The general form of the desired solution is to generalise the “action” which occurs.
In this case these are utterances. Thus in Progol’s mode declarations it is stipulated that
the generalisations must contain action(utter,Word,Time) in the head of all the
rules. Little restriction is placed on the form of the bodies of the rules. The bodies capture
the important features that must be present in the symbolic perceptual data stream in order
for an action to be performed. Examples of the rules learned for the snap game are shown
in Figure 3.
2.5 Building equivalence classes of utterances
The generalisation rules found by Progol are used to construct equivalence classes among
utterances, since the method for utterance clustering is prone to cluster into more than the
true number of clusters as mentioned above. The procedure for generating equivalence
classes is based on the hypothesis that rules with similar bodies (encoding the perceptual
inputs) are related to equivalent utterances in the rule heads (the action outputs). There are
many possible ways of defining similarity in logic programs [9]. In this work, however,
similarity is understood as classical unification of terms 3 .
To construct equivalence classes, firstly, every pair of input rules whose heads are of
the form action(utter,word,time) are checked for whether their bodies unify.
Clauses with empty bodies are discarded as they do not provide any evidence for equivalence. For every pair of clause heads whose bodies unify, a predicate equiv/2 is created, stating the hypothesis of equivalence between the utterances in their arguments.
For instance, let action(utter,w i,t x) and action(utter,w j,t y) be two
clause heads with unifying bodies, then the predicate equiv(w i,w j) represents the
hypothesis of equivalence between utterances w i and w j. Equivalence classes are created by taking the transitive closure of the relation equiv/2.
2.6 Inference engine for agent behaviour generation
The symbolic protocols learned by the Progol program are used by a Prolog program to
form an inference engine that can be used to drive an interactive cognitive agent that can
participate in its environment. With a small amount of additional housekeeping Prolog
code this program has been made to take its input from the lower level systems using
network sockets, and output its results (via a socket) to an utterance synthesis module,
which simply replays an automatically extracted audio clip of the appropriate response
(the one closest to the cluster centre). An audio-visual response from a virtual participant
could be used here (and sometimes is); however this adds nothing to the science presented.
In our system, a human participant is required to follow the instructions uttered by the
synthetic agent (as there is currently no robotic element to our system).
Currently the rules produced by Progol (ordered from most specific to most general if
3
Informally, two terms are unifiable if they have at least one common instance [2].
necessary4 ) directly form part of a Prolog program. We impose a limit of a single action
generation per time step in the (automatic) formulation of this program. We are working
on a rule interpreter which can handle a wider range of scenarios (multiple simultaneous
actions, non-deterministic/stochastic outcomes, etc.), however this is not necessary for
the scenarios presented in this paper.
3 Evaluation and results
All experiments have been performed from live audio-visual input. Firstly a period of
‘training’ is undertaken during which perceptual categories are learned: features are extracted and object/utterance models are constructed. All object/utterance descriptions are
classified to produce a symbolic data stream. The amount of training data depends upon
the activity being observed. A set of generalisations are learned using Progol, and these
are read into the Prolog inference engine module of our perceptual system. Once object,
utterance, event and protocol models have been learned, the autonomous agent can respond to visual input in a game-playing phase. Figure 3 shows a rule-set which perfectly
and concisely represents the protocol of the game snap. It can be seen from the snap
rule (the third rule) that the concept of property equality (reflexivity) can be used in the
generalisation of the training data. In this example, the two Bs indicate that the texture
categories of the two objects must be the same for a snap to occur.
action(utter,[w4,w3],A) :- state([],A).
action(utter,[w4,w1],A) :- state([[B,C]],A).
action(utter,[w5,w2,w3],A) :- state([[B,C],[B,D]],A).
action(utter,[w2,w1],A) :- state([[B,C],[D,E]],A).
Figure 3: Example of a set of symbolic rules for the Snap game. Actions are sequences
of utterances, where wi is an utterance corresponding to an utterances in an equivalence
class. (w1 = one, w2 = collect, w3 = two, w4 = roll, w5 = snap)
Two sets of objects have been used in the games. Snap is played with two dice, from
which six texture categories are formed, one for each dice face. The attention mechanism
for key-frame identification and object detection works with over 98% accuracy (minimal
noise), very occasionally tracking two objects as one when rolled very close to each other.
Empirical evaluation of the dice classification scheme has shown it to perform over 90%
correct classification on a single dice and over 80% correct classification of two dice in a
scene. The cards used in paper-scissors-stone are rather different to each other, and over
99% classification has been achieved throughout trials. Classification of the audio signal
into utterances works 99% correctly, with typically only one or two utterances misclassified out of a set of two hundred, although this is into a greater number of clusters than
the minimum possible (for which equivalence classes are formed after rule generalisation,
see Section 2.5). Forcing clustering into the ‘true’ number of clusters using this methods
was found to give poor classification. The construction of equivalence classes among utterances forms classes in which the members of each class correspond to the same vocal
4 In the case that the body of one rule is a specialisation of another, the most general rule is moved below
the most specific one in the ordering (if not the case already). This may be determined automatically using a
subsumption check on each pair of rule bodies. Otherwise rule ordering is as output by Progol.
#events in training % correct comment
36
50
Little data - learned roll-two and collect-one
74
97
Learned the protocol, snap-collect instead of
snap-collect-two
106
97
Learned the protocol, collect-two instead of
snap-collect-two
paper
108
92
Learned all but two cases (out of six) of ‘lose’,
scissors
and one draw
stone
158
94
Learned an additional incorrect ‘lose’, and
missing a draw
248
97
Learned all rules, though also an additional incorrect ‘lose’ rule.
snap
Figure 4: Evaluation results
utterance. Each word i is not always assigned to an equivalence class, for example,
when all the action rules with word i in the head have no bodies.
Figure 4 describes the results of learning the protocol of both games from real world
data, using ILP. Each game has been learned using three sizes of training data sets (events
are equivalent to time steps). The ‘% correct’ column refers to the theoretical percentage
of time that the perceptual agent provides the correct response from the learned symbolic
rules. Since the agent is embodied in a live system, only a subset of possibilities would be
presented in an empirical evaluation, which could produce unreliable results. Live testing
of the system produces similar results to those presented, with the occasional misclassification of visual objects leading to an incorrect response. Video of the learning and execution phases of the cognitive agent can be viewed at: http://www.xxx.xx.xx/xxxx.
The ILP generalisations degrade gracefully with noise, or when there is little data. Lessgeneral rules are lost, rather than the entire process failing, as more noise is introduced.
This is essential for future work involving incremental and iterative learning.
4 Discussion and conclusions
A framework for the autonomous learning of perceptual categories and symbolic protocols has been presented. It has been demonstrated that a set of object, utterance and
temporal protocol models can be learned autonomously, that may be used to drive a cognitive agent that can interact in a natural (human-like) way with the real world. The
symbolic representation used is explicitly grounded to the sensor data, since both the perceptual categories and symbolic protocols have been learned from audio and video input.
Although our synthetic agent has no robotic capability, it can issue vocal instructions and
participate in simple games. The combination of low-level statistical object models with
higher level symbolic models has been shown to be a powerful paradigm.
5 Acknowledgements
This work was funded by the European Union, as part of the CogVis project.
References
[1] S. Aksoy, C. Tusk, K. Koperski, and G. Marchisio. Scene modeling and image mining with a visual grammar. In Frontiers of Remote Sensing Information Processing,
pages 35–62. World Scientific, 2003.
[2] K. R. Apt. From logic programming to Prolog. Prentice-Hall, Inc., 1996.
[3] C. Bryant, S. Muggleton, C. Page, and M. Sternberg. Combining active learning
with inductive logic programming to close the loop in machine learning. In Proc.
AISB Symposium on AI and Scientific Creativity, 1999.
[4] Ronald A. Cole, Joseph Mariani, Hans Uszkoreit, Annie Zaenen, and Victor Zue.
Survey of the State of the Art in Human Language Technology. Cambridge University Press, 1996.
[5] R. Duda, P. Hart, and D. Stork. Pattern Classification. Wiley, 2000.
[6] A. Fern, R. Givan, and J. Siskind. Specific-to-general learning for temporal events
with application to learning event definitions from video. Journal of Artificial Intelligence Research, 17:379–449, 2002.
[7] P. Fitzpatrick, G. Metta, L. Natale, S. Rao, and G. Sandini. Learning about objects
through action - initial steps towards arfificial cognition. In Proc. IEEE International
Conference on Robotics and Automation, volume 3, pages 3140–3145, 2003.
[8] J. L. Flanagan and L.R. Rabiner. SPEECH SYNTHESIS. Dowden Hutchinsonand
Ross, Inc., 1973.
[9] F. Formato, G. Gerla, and M. I. Sessa. Similarity-based unification. Fundamenta
Informaticae, 40:393 – 414, 2000.
[10] S. Hargreaves-Heap and Y. Varoufakis. Game Theory, A Critical Introduction. Routledge, 1995.
[11] S. Harnad. The symbol grounding problem. Physica D, 42:335–346, 1990.
[12] Y. Ivanov and A. Bobick. Recognition of visual activities and interactions by
stochastic parsing. IEEE Trans. on Patterna Analysis and Machine Intellegence,
22(8):852–872, 2000.
[13] D. Kazakov and S. Dobnik. Inductive learning of lexical semantics with typed unification grammars. In Oxford Working Papers in Linguistics, Philology, and Phonetics, 2003.
[14] Eric Keller. Fundamentals of speech synthesis and speech recognition: basic concepts, state-of-the-art and future challenges. John Wiley and Sons Ltd., 1994.
[15] D. R. Magee. Tracking multiple vehicles using foreground, background and motion
models. Image and Vision Computing, 20(8):581–594, 2004.
[16] D. R. Magee, D. C. Hogg, and A. G. Cohn. Autonomous object learning using multiple feature clusterings in dynamic scenarios. Technical Report School of Computing
Research Report 2003.15, University of Leeds, UK, 2003.
[17] D. Moore and I. Essa. Recognizing multitasked activities from video using stochastic context-free grammar. In Proc. AAAI National Conf. on AI, 2002.
[18] S. Muggleton. Inverse entailment and Progol. New Generation Computing, Special
issue on Inductive Logic Programming, 13(3-4):245–286, 1995.
[19] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
[20] J.M. Siskind. Visual event classification via force dynamics. In Proc. AAAI National
Conference on AI, pages 149–155, 2000.
[21] M. Sternberg, R. King, R. Lewis, and S. Muggleton. Application of machine learning to structural molecular biology. Philosophical Transactions of the Royal Society
B, 344:365–371, 1994.
[22] A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for
combining multiple partitions. Journal of Machine Learning Research, 3:583–617,
2002.
View publication stats