Skip to main content

Gabriel Kreiman

Harvard Medical School, Boston Children's Hospital, Faculty Member

Followers

64

Following

22

Co-authors

20

Public Views

Lab website
http://klab.tch.harvard.edu

less

InterestsView All (16)

Uploads

Papers by Gabriel Kreiman

Look twice: A generalist computational model predicts return fixations across tasks and species

PLOS Computational Biology

Primates constantly explore their surroundings via saccadic eye movements that bring different pa... more Primates constantly explore their surroundings via saccadic eye movements that bring different parts of an image into high resolution. In addition to exploring new regions in the visual field, primates also make frequent return fixations, revisiting previously foveated locations. We systematically studied a total of 44,328 return fixations out of 217,440 fixations. Return fixations were ubiquitous across different behavioral tasks, in monkeys and humans, both when subjects viewed static images and when subjects performed natural behaviors. Return fixations locations were consistent across subjects, tended to occur within short temporal offsets, and typically followed a 180-degree turn in saccadic direction. To understand the origin of return fixations, we propose a proof-of-principle, biologically-inspired and image-computable neural network model. The model combines five key modules: an image feature extractor, bottom-up saliency cues, task-relevant visual features, finite inhibiti...

Neurons detect cognitive boundaries to structure episodic memories in humans

Nature Neuroscience

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

In recent years, multi-modal transformers have shown significant progress in Vision-Language task... more In recent years, multi-modal transformers have shown significant progress in Vision-Language tasks, such as Visual Question Answering (VQA), outperforming previous architectures by a considerable margin. This improvement in VQA is often attributed to the rich interactions between vision and language streams. In this work, we investigate the efficacy of co-attention transformer layers in helping the network focus on relevant regions while answering the question. We generate visual attention maps using the question-conditioned image attention scores in these co-attention layers. We evaluate the effect of the following critical components on visual attention of a state-of-the-art VQA model: (i) number of object region proposals, (ii) question part of speech (POS) tags, (iii) question semantics, (iv) number of co-attention layers, and (v) answer accuracy. We compare the neural network attention maps against human attention maps both qualitatively and quantitatively. Our findings indicat...

Task-specific neural processes underlying conflict resolution during cognitive control

Cognitive control involves flexibly combining multiple sensory inputs with task-dependent goals d... more Cognitive control involves flexibly combining multiple sensory inputs with task-dependent goals during decision making. Several tasks have been proposed to examine cognitive control, including Stroop, Eriksen-Flanker, and the Multi-source interference task. Because these tasks have been studied independently, it remains unclear whether the neural signatures of cognitive control reflect abstract control mechanisms or specific combinations of sensory and behavioral aspects of each task. To address this question, here we recorded invasive neurophysiological signals from 16 subjects and directly compared the three tasks against each other. Neural activity patterns in the theta and high-gamma frequency bands differed between incongruent and congruent conditions, revealing strong modulation by conflicting task demands. These neural signals were specific to each task, generalizing within a task but not across tasks. These results highlight the complex interplay between sensory inputs, moto...

Lifelong Compositional Feature Replays Beat Image Replays in Stream Learning

arXiv (Cornell University), Nov 23, 2021

Minimal videos: Trade-off between spatial and temporal information in human and machine vision

Cognition, 2020

A machine learning approach to predict episodic memory formation

2016 Annual Conference on Information Science and Systems (CISS), 2016

Episodic memories constitute the essence of our recollections and are formed by autobiographical ... more Episodic memories constitute the essence of our recollections and are formed by autobiographical experiences and contextual knowledge. Memories are rich and detailed, yet at the same time they can be malleable and inaccurate. The contents that end up being remembered are the result of filtering incoming sensory inputs in the context of previous knowledge. Here we asked whether the quintessentially subjective process of memory construction could be predicted by a supervised machine learning approach based exclusively on content information. We considered audiovisual segments from a movie as a proxy for real-life memory formation and built a quantitative model to explain psychophysics data evaluating recognition memory. The inputs to the model included audiovisual information (e.g. presence of specific characters, objects, voices and sounds), scene information (e.g. location, presence or absence of action) and emotional valence information. The machine-learning model could predict memory formation in single trials both for group averages and individual subjects with an accuracy of up to 80% using solely stimulus content properties. These results provide a quantitative and predictive model that links sensory perception and emotional attributes to memory formation. Furthermore, the results demonstrate that a computational model can make sophisticated inferences about a cognitive process that involves selective filtering and subjective interpretation.

Quantitative profiling of peptides from RNAs classified as noncoding

Nature communications, Jan 18, 2014

Only a small fraction of the mammalian genome codes for messenger RNAs destined to be translated ... more Only a small fraction of the mammalian genome codes for messenger RNAs destined to be translated into proteins, and it is generally assumed that a large portion of transcribed sequences--including introns and several classes of noncoding RNAs (ncRNAs)--do not give rise to peptide products. A systematic examination of translation and physiological regulation of ncRNAs has not been conducted. Here we use computational methods to identify the products of non-canonical translation in mouse neurons by analysing unannotated transcripts in combination with proteomic data. This study supports the existence of non-canonical translation products from both intragenic and extragenic genomic regions, including peptides derived from antisense transcripts and introns. Moreover, the studied novel translation products exhibit temporal regulation similar to that of proteins known to be involved in neuronal activity processes. These observations highlight a potentially large and complex set of biologi...

A gene atlas of the mouse and human protein-encoding transcriptomes

Proceedings of the National Academy of Sciences, 2004

The tissue-specific pattern of mRNA expression can indicate important clues about gene function. ... more The tissue-specific pattern of mRNA expression can indicate important clues about gene function. High-density oligonucleotide arrays offer the opportunity to examine patterns of gene expression on a genome scale. Toward this end, we have designed custom arrays that interrogate the expression of the vast majority of protein-encoding human and mouse genes and have used them to profile a panel of 79 human and 61 mouse tissues. The resulting data set provides the expression patterns for thousands of predicted genes, as well as known and poorly characterized genes, from mice and humans. We have explored this data set for global trends in gene expression, evaluated commonly used lines of evidence in gene prediction methodologies, and investigated patterns indicative of chromosomal organization of transcription. We describe hundreds of regions of correlated transcription and show that some are subject to both tissue and parental allele-specific expression, suggesting a link between spatial...

Reading Out Visual Information from Populations of Neurons in Inferior Temporal and Prefrontal Cortex

The ability of primates to recognize visual objects is believed to be based on a series of transf... more The ability of primates to recognize visual objects is believed to be based on a series of transformations of visual information that occur as signals travel down the ventral pathway from primary visual cortex (V1) through inferior temporal cortex (IT) and ultimately on to prefrontal cortex (PFC) where decision making circuitry is believed to reside. Thus, a key component to understanding how object recognition occurs is to determine how visual information is represented in these different cortical areas, as well as how these ...

Error-driven Input Modulation: Solving the Credit Assignment Problem without a Backward Pass

Supervised learning in artificial neural networks typically relies on backpropagation, where the ... more Supervised learning in artificial neural networks typically relies on backpropagation, where the weights are updated based on the error-function gradients and sequentially propagated from the output layer to the input layer. Although this approach has proven effective in a wide domain of applications, it lacks biological plausibility in many regards, including the weight symmetry problem, the dependence of learning on non-local signals, the freezing of neural activity during error propagation, and the update locking problem. Alternative training schemes - such as sign symmetry, feedback alignment, and direct feedback alignment - have been introduced, but invariably rely on a backward pass that hinders the possibility of solving all the issues simultaneously. Here, we propose to replace the backward pass with a second forward pass in which the input signal is modulated based on the error of the network. We show that this novel learning rule comprehensively addresses all the above-men...

Lift-the-Flap: Context Reasoning Using Object-Centered Graphs

ArXiv, 2019

Children benefit from lift-the-flap books by taking on an active role in guessing what is behind ... more Children benefit from lift-the-flap books by taking on an active role in guessing what is behind the flap based on the context. In this paper, we introduce lift-the-flap games for computational models. The task is to reason about the scene context and infer what the target behind the flap is in a natural image. Context reasoning is critical in many computer vision applications, such as object recognition and semantic segmentation. To tackle this problem, we propose an object-centered graph representing the scene configuration of the image where each node corresponds to a group of objects belonging to the same category. To infer the target's class label, we introduce an object-centered graph network model consisting of two sub-networks. The classification sub-network takes the complete graph as input and outputs a classification vector assigning the probability for each class. The reinforcement learning sub-network exploits the class label dependencies and learns the joint probab...

Look Twice: A Computational Model of Return Fixations across Tasks and Species

ArXiv, 2021

Saccadic eye movements allow animals to bring different parts of an image into high-resolution. D... more Saccadic eye movements allow animals to bring different parts of an image into high-resolution. During free viewing, inhibition of return incentivizes exploration by discouraging previously visited locations. Despite this inhibition, here we show that subjects make frequent return fixations. We systematically studied a total of 44,328 return fixations out of 217,440 fixations across different tasks, in monkeys and humans, and in static images or egocentric videos. The ubiquitous return fixations were consistent across subjects, tended to occur within short offsets, and were characterized by longer duration than non-return fixations. The locations of return fixations corresponded to image areas of higher saliency and higher similarity to the sought target during visual search tasks. We propose a biologically-inspired computational model that capitalizes on a deep convolutional neural network for object recognition to predict a sequence of fixations. Given an input image, the model co...

On the Robustness of Convolutional Neural Networks to Internal Architecture and Weight Perturbations

ArXiv, 2017

Deep convolutional neural networks are generally regarded as robust function approximators. So fa... more Deep convolutional neural networks are generally regarded as robust function approximators. So far, this intuition is based on perturbations to external stimuli such as the images to be classified. Here we explore the robustness of convolutional neural networks to perturbations to the internal weights and architecture of the network itself. We show that convolutional networks are surprisingly robust to a number of internal perturbations in the higher convolutional layers but the bottom convolutional layers are much more fragile. For instance, Alexnet shows less than a 30% decrease in classification performance when randomly removing over 70% of weight connections in the top convolutional or dense layers but performance is almost at chance with the same perturbation in the first convolutional layer. Finally, we suggest further investigations which could continue to inform the robustness of convolutional networks to internal perturbations.

Hms230: Visual Object Recognition Chapter 7: First Steps into Inferior Temporal Cortex

BEWARE: These are preliminary notes. In the future, they will become part of a textbook on Visual... more BEWARE: These are preliminary notes. In the future, they will become part of a textbook on Visual Object Recognition. Inferior temporal cortex (ITC) is the highest echelon within the visual stream concerned with processing visual shape information 1. As such, one may expect that some of the key properties of visual perception may be encoded in the activity of ensembles of neurons in ITC. The story of how inferior temporal cortex became accepted and described as a visual area is a rather interesting one; we encourage readers to consult (Gross, 1994) for a lucid historical discussion. Imagine that you are interested in finding out the functions and properties of a given brain area, say inferior temporal cortex (ITC) within the primate ventral visual stream. As we have discussed before (Chapter 4), part of the answer to this question may come from lesion studies. Bilateral lesions to ITC cause severe impairment in visual object recognition in macaque monkeys and several human object ag...

Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

ArXiv, 2017

While great strides have been made in using deep learning algorithms to solve supervised learning... more While great strides have been made in using deep learning algorithms to solve supervised learning tasks, the problem of unsupervised learning - leveraging unlabeled examples to learn about the structure of a domain - remains a difficult unsolved challenge. Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We describe a predictive neural network ("PredNet") architecture that is inspired by the concept of "predictive coding" from the neuroscience literature. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding ...

Hypothesis-driven Stream Learning with Augmented Memory

ArXiv, 2021

Stream learning refers to the ability to acquire and transfer knowledge across a continuous strea... more Stream learning refers to the ability to acquire and transfer knowledge across a continuous stream of data without forgetting and without repeated passes over the data. A common way to avoid catastrophic forgetting is to intersperse new examples with replays of old examples stored as image pixels or reproduced by generative models. Here, we considered stream learning in image classification tasks and proposed a novel hypotheses-driven Augmented Memory Network, which efficiently consolidates previous knowledge with a limited number of hypotheses in the augmented memory and replays relevant hypotheses to avoid catastrophic forgetting. The advantages of hypothesis-driven replay over image pixel replay and generative replay are two-fold. First, hypothesis-based knowledge consolidation avoids redundant information in the image pixel space and makes memory usage more efficient. Second, hypotheses in the augmented memory can be re-used for learning new tasks, improving generalization and t...

Neurobiology 230. Visual Object Recognition Chapter 1: Introduction to Visual Recognition

BEWARE: These are only highly preliminary notes. In the future, they will become part of a textbo... more BEWARE: These are only highly preliminary notes. In the future, they will become part of a textbook on " Visual Object Recognition ". In the meantime, please interpret with caution. Feedback is welcome at Why is vision difficult? Visual object recognition is essential for most everyday tasks including navigation, reading and socialization. It is therefore not much of a strain to conceive that the expansion of visual cortex has played a significant role in the evolution of mammals in general and primates in particular. The evolution of enhanced algorithms for recognizing patterns based on visual input is likely to have yielded a significant increase in adaptive value through improvement in navigation, recognition of danger and food as well as social interactions. In contrast to tactile and to some extent even auditory inputs, visual signals provide information from far away and large areas. While olfactory signals can also propagate long distances, the speed of propagation ...

Learning scene gist with convolutional neural networks to improve object recognition

2018 52nd Annual Conference on Information Sciences and Systems (CISS)

Unsupervised Learning of Visual Structure using Predictive Generative Networks

ArXiv, 2015

The ability to predict future states of the environment is a central pillar of intelligence. At i... more The ability to predict future states of the environment is a central pillar of intelligence. At its core, effective prediction requires an internal model of the world and an understanding of the rules by which the world changes. Here, we explore the internal models developed by deep neural networks trained using a loss based on predicting future frames in synthetic video sequences, using a CNN-LSTM-deCNN framework. We first show that this architecture can achieve excellent performance in visual sequence prediction tasks, including state-of-the-art performance in a standard 'bouncing balls' dataset (Sutskever et al., 2009). Using a weighted mean-squared error and adversarial loss (Goodfellow et al., 2014), the same architecture successfully extrapolates out-of-the-plane rotations of computer-generated faces. Furthermore, despite being trained end-to-end to predict only pixel-level information, our Predictive Generative Networks learn a representation of the latent structure o...

Look twice: A generalist computational model predicts return fixations across tasks and species

PLOS Computational Biology

Primates constantly explore their surroundings via saccadic eye movements that bring different pa... more Primates constantly explore their surroundings via saccadic eye movements that bring different parts of an image into high resolution. In addition to exploring new regions in the visual field, primates also make frequent return fixations, revisiting previously foveated locations. We systematically studied a total of 44,328 return fixations out of 217,440 fixations. Return fixations were ubiquitous across different behavioral tasks, in monkeys and humans, both when subjects viewed static images and when subjects performed natural behaviors. Return fixations locations were consistent across subjects, tended to occur within short temporal offsets, and typically followed a 180-degree turn in saccadic direction. To understand the origin of return fixations, we propose a proof-of-principle, biologically-inspired and image-computable neural network model. The model combines five key modules: an image feature extractor, bottom-up saliency cues, task-relevant visual features, finite inhibiti...

Neurons detect cognitive boundaries to structure episodic memories in humans

Nature Neuroscience

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

In recent years, multi-modal transformers have shown significant progress in Vision-Language task... more In recent years, multi-modal transformers have shown significant progress in Vision-Language tasks, such as Visual Question Answering (VQA), outperforming previous architectures by a considerable margin. This improvement in VQA is often attributed to the rich interactions between vision and language streams. In this work, we investigate the efficacy of co-attention transformer layers in helping the network focus on relevant regions while answering the question. We generate visual attention maps using the question-conditioned image attention scores in these co-attention layers. We evaluate the effect of the following critical components on visual attention of a state-of-the-art VQA model: (i) number of object region proposals, (ii) question part of speech (POS) tags, (iii) question semantics, (iv) number of co-attention layers, and (v) answer accuracy. We compare the neural network attention maps against human attention maps both qualitatively and quantitatively. Our findings indicat...

Task-specific neural processes underlying conflict resolution during cognitive control

Cognitive control involves flexibly combining multiple sensory inputs with task-dependent goals d... more Cognitive control involves flexibly combining multiple sensory inputs with task-dependent goals during decision making. Several tasks have been proposed to examine cognitive control, including Stroop, Eriksen-Flanker, and the Multi-source interference task. Because these tasks have been studied independently, it remains unclear whether the neural signatures of cognitive control reflect abstract control mechanisms or specific combinations of sensory and behavioral aspects of each task. To address this question, here we recorded invasive neurophysiological signals from 16 subjects and directly compared the three tasks against each other. Neural activity patterns in the theta and high-gamma frequency bands differed between incongruent and congruent conditions, revealing strong modulation by conflicting task demands. These neural signals were specific to each task, generalizing within a task but not across tasks. These results highlight the complex interplay between sensory inputs, moto...

Lifelong Compositional Feature Replays Beat Image Replays in Stream Learning

arXiv (Cornell University), Nov 23, 2021

Minimal videos: Trade-off between spatial and temporal information in human and machine vision

Cognition, 2020

A machine learning approach to predict episodic memory formation

2016 Annual Conference on Information Science and Systems (CISS), 2016

Episodic memories constitute the essence of our recollections and are formed by autobiographical ... more Episodic memories constitute the essence of our recollections and are formed by autobiographical experiences and contextual knowledge. Memories are rich and detailed, yet at the same time they can be malleable and inaccurate. The contents that end up being remembered are the result of filtering incoming sensory inputs in the context of previous knowledge. Here we asked whether the quintessentially subjective process of memory construction could be predicted by a supervised machine learning approach based exclusively on content information. We considered audiovisual segments from a movie as a proxy for real-life memory formation and built a quantitative model to explain psychophysics data evaluating recognition memory. The inputs to the model included audiovisual information (e.g. presence of specific characters, objects, voices and sounds), scene information (e.g. location, presence or absence of action) and emotional valence information. The machine-learning model could predict memory formation in single trials both for group averages and individual subjects with an accuracy of up to 80% using solely stimulus content properties. These results provide a quantitative and predictive model that links sensory perception and emotional attributes to memory formation. Furthermore, the results demonstrate that a computational model can make sophisticated inferences about a cognitive process that involves selective filtering and subjective interpretation.

Quantitative profiling of peptides from RNAs classified as noncoding

Nature communications, Jan 18, 2014

Only a small fraction of the mammalian genome codes for messenger RNAs destined to be translated ... more Only a small fraction of the mammalian genome codes for messenger RNAs destined to be translated into proteins, and it is generally assumed that a large portion of transcribed sequences--including introns and several classes of noncoding RNAs (ncRNAs)--do not give rise to peptide products. A systematic examination of translation and physiological regulation of ncRNAs has not been conducted. Here we use computational methods to identify the products of non-canonical translation in mouse neurons by analysing unannotated transcripts in combination with proteomic data. This study supports the existence of non-canonical translation products from both intragenic and extragenic genomic regions, including peptides derived from antisense transcripts and introns. Moreover, the studied novel translation products exhibit temporal regulation similar to that of proteins known to be involved in neuronal activity processes. These observations highlight a potentially large and complex set of biologi...

A gene atlas of the mouse and human protein-encoding transcriptomes

Proceedings of the National Academy of Sciences, 2004

The tissue-specific pattern of mRNA expression can indicate important clues about gene function. ... more The tissue-specific pattern of mRNA expression can indicate important clues about gene function. High-density oligonucleotide arrays offer the opportunity to examine patterns of gene expression on a genome scale. Toward this end, we have designed custom arrays that interrogate the expression of the vast majority of protein-encoding human and mouse genes and have used them to profile a panel of 79 human and 61 mouse tissues. The resulting data set provides the expression patterns for thousands of predicted genes, as well as known and poorly characterized genes, from mice and humans. We have explored this data set for global trends in gene expression, evaluated commonly used lines of evidence in gene prediction methodologies, and investigated patterns indicative of chromosomal organization of transcription. We describe hundreds of regions of correlated transcription and show that some are subject to both tissue and parental allele-specific expression, suggesting a link between spatial...

Reading Out Visual Information from Populations of Neurons in Inferior Temporal and Prefrontal Cortex

The ability of primates to recognize visual objects is believed to be based on a series of transf... more The ability of primates to recognize visual objects is believed to be based on a series of transformations of visual information that occur as signals travel down the ventral pathway from primary visual cortex (V1) through inferior temporal cortex (IT) and ultimately on to prefrontal cortex (PFC) where decision making circuitry is believed to reside. Thus, a key component to understanding how object recognition occurs is to determine how visual information is represented in these different cortical areas, as well as how these ...

Error-driven Input Modulation: Solving the Credit Assignment Problem without a Backward Pass

Supervised learning in artificial neural networks typically relies on backpropagation, where the ... more Supervised learning in artificial neural networks typically relies on backpropagation, where the weights are updated based on the error-function gradients and sequentially propagated from the output layer to the input layer. Although this approach has proven effective in a wide domain of applications, it lacks biological plausibility in many regards, including the weight symmetry problem, the dependence of learning on non-local signals, the freezing of neural activity during error propagation, and the update locking problem. Alternative training schemes - such as sign symmetry, feedback alignment, and direct feedback alignment - have been introduced, but invariably rely on a backward pass that hinders the possibility of solving all the issues simultaneously. Here, we propose to replace the backward pass with a second forward pass in which the input signal is modulated based on the error of the network. We show that this novel learning rule comprehensively addresses all the above-men...

Lift-the-Flap: Context Reasoning Using Object-Centered Graphs

ArXiv, 2019

Children benefit from lift-the-flap books by taking on an active role in guessing what is behind ... more Children benefit from lift-the-flap books by taking on an active role in guessing what is behind the flap based on the context. In this paper, we introduce lift-the-flap games for computational models. The task is to reason about the scene context and infer what the target behind the flap is in a natural image. Context reasoning is critical in many computer vision applications, such as object recognition and semantic segmentation. To tackle this problem, we propose an object-centered graph representing the scene configuration of the image where each node corresponds to a group of objects belonging to the same category. To infer the target's class label, we introduce an object-centered graph network model consisting of two sub-networks. The classification sub-network takes the complete graph as input and outputs a classification vector assigning the probability for each class. The reinforcement learning sub-network exploits the class label dependencies and learns the joint probab...

Look Twice: A Computational Model of Return Fixations across Tasks and Species

ArXiv, 2021

Saccadic eye movements allow animals to bring different parts of an image into high-resolution. D... more Saccadic eye movements allow animals to bring different parts of an image into high-resolution. During free viewing, inhibition of return incentivizes exploration by discouraging previously visited locations. Despite this inhibition, here we show that subjects make frequent return fixations. We systematically studied a total of 44,328 return fixations out of 217,440 fixations across different tasks, in monkeys and humans, and in static images or egocentric videos. The ubiquitous return fixations were consistent across subjects, tended to occur within short offsets, and were characterized by longer duration than non-return fixations. The locations of return fixations corresponded to image areas of higher saliency and higher similarity to the sought target during visual search tasks. We propose a biologically-inspired computational model that capitalizes on a deep convolutional neural network for object recognition to predict a sequence of fixations. Given an input image, the model co...

On the Robustness of Convolutional Neural Networks to Internal Architecture and Weight Perturbations

ArXiv, 2017

Deep convolutional neural networks are generally regarded as robust function approximators. So fa... more Deep convolutional neural networks are generally regarded as robust function approximators. So far, this intuition is based on perturbations to external stimuli such as the images to be classified. Here we explore the robustness of convolutional neural networks to perturbations to the internal weights and architecture of the network itself. We show that convolutional networks are surprisingly robust to a number of internal perturbations in the higher convolutional layers but the bottom convolutional layers are much more fragile. For instance, Alexnet shows less than a 30% decrease in classification performance when randomly removing over 70% of weight connections in the top convolutional or dense layers but performance is almost at chance with the same perturbation in the first convolutional layer. Finally, we suggest further investigations which could continue to inform the robustness of convolutional networks to internal perturbations.

Hms230: Visual Object Recognition Chapter 7: First Steps into Inferior Temporal Cortex

BEWARE: These are preliminary notes. In the future, they will become part of a textbook on Visual... more BEWARE: These are preliminary notes. In the future, they will become part of a textbook on Visual Object Recognition. Inferior temporal cortex (ITC) is the highest echelon within the visual stream concerned with processing visual shape information 1. As such, one may expect that some of the key properties of visual perception may be encoded in the activity of ensembles of neurons in ITC. The story of how inferior temporal cortex became accepted and described as a visual area is a rather interesting one; we encourage readers to consult (Gross, 1994) for a lucid historical discussion. Imagine that you are interested in finding out the functions and properties of a given brain area, say inferior temporal cortex (ITC) within the primate ventral visual stream. As we have discussed before (Chapter 4), part of the answer to this question may come from lesion studies. Bilateral lesions to ITC cause severe impairment in visual object recognition in macaque monkeys and several human object ag...

Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

ArXiv, 2017

While great strides have been made in using deep learning algorithms to solve supervised learning... more While great strides have been made in using deep learning algorithms to solve supervised learning tasks, the problem of unsupervised learning - leveraging unlabeled examples to learn about the structure of a domain - remains a difficult unsolved challenge. Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We describe a predictive neural network ("PredNet") architecture that is inspired by the concept of "predictive coding" from the neuroscience literature. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding ...

Hypothesis-driven Stream Learning with Augmented Memory

ArXiv, 2021

Stream learning refers to the ability to acquire and transfer knowledge across a continuous strea... more Stream learning refers to the ability to acquire and transfer knowledge across a continuous stream of data without forgetting and without repeated passes over the data. A common way to avoid catastrophic forgetting is to intersperse new examples with replays of old examples stored as image pixels or reproduced by generative models. Here, we considered stream learning in image classification tasks and proposed a novel hypotheses-driven Augmented Memory Network, which efficiently consolidates previous knowledge with a limited number of hypotheses in the augmented memory and replays relevant hypotheses to avoid catastrophic forgetting. The advantages of hypothesis-driven replay over image pixel replay and generative replay are two-fold. First, hypothesis-based knowledge consolidation avoids redundant information in the image pixel space and makes memory usage more efficient. Second, hypotheses in the augmented memory can be re-used for learning new tasks, improving generalization and t...

Neurobiology 230. Visual Object Recognition Chapter 1: Introduction to Visual Recognition

BEWARE: These are only highly preliminary notes. In the future, they will become part of a textbo... more BEWARE: These are only highly preliminary notes. In the future, they will become part of a textbook on " Visual Object Recognition ". In the meantime, please interpret with caution. Feedback is welcome at Why is vision difficult? Visual object recognition is essential for most everyday tasks including navigation, reading and socialization. It is therefore not much of a strain to conceive that the expansion of visual cortex has played a significant role in the evolution of mammals in general and primates in particular. The evolution of enhanced algorithms for recognizing patterns based on visual input is likely to have yielded a significant increase in adaptive value through improvement in navigation, recognition of danger and food as well as social interactions. In contrast to tactile and to some extent even auditory inputs, visual signals provide information from far away and large areas. While olfactory signals can also propagate long distances, the speed of propagation ...

Learning scene gist with convolutional neural networks to improve object recognition

2018 52nd Annual Conference on Information Sciences and Systems (CISS)

Unsupervised Learning of Visual Structure using Predictive Generative Networks

ArXiv, 2015

The ability to predict future states of the environment is a central pillar of intelligence. At i... more The ability to predict future states of the environment is a central pillar of intelligence. At its core, effective prediction requires an internal model of the world and an understanding of the rules by which the world changes. Here, we explore the internal models developed by deep neural networks trained using a loss based on predicting future frames in synthetic video sequences, using a CNN-LSTM-deCNN framework. We first show that this architecture can achieve excellent performance in visual sequence prediction tasks, including state-of-the-art performance in a standard 'bouncing balls' dataset (Sutskever et al., 2009). Using a weighted mean-squared error and adversarial loss (Goodfellow et al., 2014), the same architecture successfully extrapolates out-of-the-plane rotations of computer-generated faces. Furthermore, despite being trained end-to-end to predict only pixel-level information, our Predictive Generative Networks learn a representation of the latent structure o...