Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Faces and Voices Processing in Human and Primate Brains: Rhythmic and Multimodal Mechanisms Underlying the Evolution and Development of Speech

Frontiers in Psychology , 2022
While influential works since the 1970s have widely assumed that imitation is an innate skill in both human and non-human primate neonates, recent empirical studies and meta-analyses have challenged this view, indicating other forms of reward-based learning as relevant factors in the development of social behavior. The visual input translation into matching motor output that underlies imitation abilities instead seems to develop along with social interactions and sensorimotor experience during infancy and childhood. Recently, a new visual stream has been identified in both human and non-human primate brains, updating the dual visual stream model. This third pathway is thought to be specialized for dynamics aspects of social perceptions such as eyegaze, facial expression and crucially for audiovisual integration of speech. Here, we review empirical studies addressing an understudied but crucial aspect of speech and communication, namely the processing of visual orofacial cues (i.e., the perception of a speaker's lips and tongue movements) and its integration with vocal auditory cues. Along this review, we offer new insights from our understanding of speech as the product of evolution and development of a rhythmic and multimodal organization of sensorimotor brain networks, supporting volitional motor control of the upper vocal tract and audiovisual voices-faces integration....Read more
REVIEW published: 30 March 2022 doi: 10.3389/fpsyg.2022.829083 Edited by: Junru Wu, East China Normal University, China Reviewed by: Takenobu Murakami, Fukushima Medical University, Japan Koen de Reus, Vrije University Brussel, Belgium *Correspondence: Maëva Michon mmichon@uc.cl Specialty section: This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology Received: 04 December 2021 Accepted: 07 March 2022 Published: 30 March 2022 Citation: Michon M, Zamorano-Abramson J and Aboitiz F (2022) Faces and Voices Processing in Human and Primate Brains: Rhythmic and Multimodal Mechanisms Underlying the Evolution and Development of Speech. Front. Psychol. 13:829083. doi: 10.3389/fpsyg.2022.829083 Faces and Voices Processing in Human and Primate Brains: Rhythmic and Multimodal Mechanisms Underlying the Evolution and Development of Speech Maëva Michon 1,2 * , José Zamorano-Abramson 3 and Francisco Aboitiz 1 1 Laboratory for Cognitive and Evolutionary Neuroscience, Department of Psychiatry, Faculty of Medicine, Interdisciplinary Center for Neuroscience, Pontificia Universidad Católica de Chile, Santiago, Chile, 2 Centro de Estudios en Neurociencia Humana y Neuropsicología, Facultad de Psicología, Universidad Diego Portales, Santiago, Chile, 3 Centro de Investigación en Complejidad Social, Facultad de Gobierno, Universidad del Desarrollo, Santiago, Chile While influential works since the 1970s have widely assumed that imitation is an innate skill in both human and non-human primate neonates, recent empirical studies and meta-analyses have challenged this view, indicating other forms of reward-based learning as relevant factors in the development of social behavior. The visual input translation into matching motor output that underlies imitation abilities instead seems to develop along with social interactions and sensorimotor experience during infancy and childhood. Recently, a new visual stream has been identified in both human and non-human primate brains, updating the dual visual stream model. This third pathway is thought to be specialized for dynamics aspects of social perceptions such as eye- gaze, facial expression and crucially for audio-visual integration of speech. Here, we review empirical studies addressing an understudied but crucial aspect of speech and communication, namely the processing of visual orofacial cues (i.e., the perception of a speaker’s lips and tongue movements) and its integration with vocal auditory cues. Along this review, we offer new insights from our understanding of speech as the product of evolution and development of a rhythmic and multimodal organization of sensorimotor brain networks, supporting volitional motor control of the upper vocal tract and audio-visual voices-faces integration. Keywords: visual speech, multimodal integration, imitation, primate social brain, speech evolution, speech development, audiovisual speech, face-voice integration INTRODUCTION This review aims to integrate seemingly disparate evidence for different kinds of communicative behaviors (i.e., imitation, speech and lip-smacking) in humans and non-human primates (NHPs). Accounting for recently proposed anatomic-functional networks involved in primates’ social interactions, we attempt to provide new avenues for understanding how speech might have arisen from phylogenetically conserved multimodal and rhythmic neural properties. Frontiers in Psychology | www.frontiersin.org 1 March 2022 | Volume 13 | Article 829083
Michon et al. Rhythmic and Multimodal Brains for Speech We first address long-standing issues in the field of neonatal imitation research in both human and monkey newborns. In line with recent findings, we propose that rather than being exclusively innate, imitative behaviors are largely scaffolded by sensorimotor development and domain-general associative learning of multimodal information. Importantly, we argue that the development of these early abilities is largely supported by socially rewarding interactions with others. By the mean of these interactions, infants begin to associate what is seen (visual input), with what is heard (auditory input) and performed (motor output), and to learn the sensory consequences of their own and others’ actions. The evidence reviewed in section “Cross-Species Developmental Trajectories of Multimodal Integration” suggests that this socially guided and domain-general associative learning of multimodal information begins within the first year of life and could support the perceptual attunement for native auditory and visual speech. Once the perceptual system has narrowed in favor to the native stimuli present in their environment, infants can extract the regularities of their linguistic input and learn the multimodal associations between auditory (how it sounds), visual (how it is pronounced) and articulatory (how to pronounce it) aspects of their native language. Then, we introduce the third visual pathway, a stream that was recently proposed to update the well-established model of the dual visual pathways and which is thought to be specialized for dynamic aspects of social perception. More specifically, the third visual pathway was shown to run laterally from V1 to the anterior temporal region along the superior temporal sulcus (STS) and to preferentially respond to biological movements of faces and bodies. The proponents of the third visual pathway report evidence supporting the involvement of STS in higher order social cognition, such as the recognition and understanding of others’ intentions and goals based on their actions and behaviors, including grasping movements, eye-gaze direction and facial expressions. Interestingly, the posterior portion of the STS is known to respond both to orofacial movements (i.e., speaking faces) and voices, making this region an ideal candidate to support the integration of faces and voices during audiovisual speech perception. We begin the last section by reviewing the strongly reminiscent rhythmic pattern of human speech and monkey lip-smacking. Namely, these human and NHP communicative behaviors are highly rhythmic and produced at a particular rate within the theta frequency band. Remarkably, the synchronization of voices and mouth movements was documented not only during human speech production but also during monkey lip-smacking, where the acoustic envelop of vocalizations couples with inter-lips distance, both oscillating rhythmically around 4-to-5 Hz. This synchronization was recently documented in chimpanzees and marmoset monkeys, indicating that these coupled oscillations may have been crucial for the emergence of speech and must have evolved early in the primate lineage. In section “Volitional Control of the Vocal Tract,” we emphasize on an important evolutionary adaptation of the structural connectivity of a cortico-subcortical network supporting the cognitive control of the vocal tract, which could have progressively allowed a finer control over speech sounds production. More specifically, the greater control over complex sequences of oral and vocal articulation that characterizes human speech compared to monkeys’ vocalizations could have been strengthened during evolution by more robust and direct connections between the laryngeal motor cortex and brainstem nuclei controlling volitional vocal folds vibrations as well as lips and tongue movements. Finally, we report evidence of cross-species similarities and differences in developmental trajectories for audiovisual speech perception. Namely, during the first year of life infants show a progressive specialization of auditory (phonemes, vocalizations) and visual (faces, speaking mouths) systems for the discrimination of native input, at the cost of non-native input. This developmental pattern is known as “perceptual narrowing” and has been described in both human and NHP infants with analogous timing. Interestingly however, although human and monkey infants exhibit a similar interest for the eyes, monkeys’ infants have been shown to pay less attention to the mouths, a region of other’s faces that convey critical visual communicative cues that facilitate the auditory processing of communicative vocal behaviors and foster expressive language development. HOMO IMITANS? METHODOLOGICAL AND THEORETICAL CONTROVERSIES Do Humans Imitate From Birth? In psychological science, imitation is understood as the ability to copy the topography of a behavior (e.g., body movements, vocal or facial expressions) observed in a third person or agent (Heyes, 2021). However, researchers distinguish several forms of imitation that may differ in the complexity of their cognitive underpinnings (Zentall, 2012). An accurate imitation requires the imitator to generate a correspondence between what is seen or heard and what is performed. In other words, crossmodal associations are needed to map the visual or auditory information provided by the model into a matching motor sequence. The main problem raised by imitation is how these sensorimotor associations are established and by means of which neurocognitive mechanisms. This problem is known as the “correspondence problem” and it is still vividly debated in the scientific community. Since the late 1970s, influential works have argued that the ability to imitate is already present in neonates from 2-to-3 weeks old who successfully imitate facial gestures such as tongue or lip protrusion and mouth-opening (see Figure 1A; Meltzoff and Moore, 1977, 1997; Meltzoff, 1988). These results led to introduce the popular idea of an innate, hardwired module for imitation and human infants started to be considered as “Homo imitans” (Meltzoff, 1988). Although debated for several decades, it was not until recently that neonate imitation became one of the most controversial phenomena in the field of developmental cognitive science (Kennedy-Costantini et al., 2017; Heyes et al., 2020; Davis et al., 2021). The skepticism around the idea that imitation is in our genes arose with several studies showing that neonates elicit facial gestures in response to different kind of stimuli (Jones, 2017; Frontiers in Psychology | www.frontiersin.org 2 March 2022 | Volume 13 | Article 829083
REVIEW published: 30 March 2022 doi: 10.3389/fpsyg.2022.829083 Faces and Voices Processing in Human and Primate Brains: Rhythmic and Multimodal Mechanisms Underlying the Evolution and Development of Speech Maëva Michon 1,2* , José Zamorano-Abramson 3 and Francisco Aboitiz 1 1 Laboratory for Cognitive and Evolutionary Neuroscience, Department of Psychiatry, Faculty of Medicine, Interdisciplinary Center for Neuroscience, Pontificia Universidad Católica de Chile, Santiago, Chile, 2 Centro de Estudios en Neurociencia Humana y Neuropsicología, Facultad de Psicología, Universidad Diego Portales, Santiago, Chile, 3 Centro de Investigación en Complejidad Social, Facultad de Gobierno, Universidad del Desarrollo, Santiago, Chile Edited by: Junru Wu, East China Normal University, China Reviewed by: Takenobu Murakami, Fukushima Medical University, Japan Koen de Reus, Vrije University Brussel, Belgium *Correspondence: Maëva Michon mmichon@uc.cl Specialty section: This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology Received: 04 December 2021 Accepted: 07 March 2022 Published: 30 March 2022 Citation: Michon M, Zamorano-Abramson J and Aboitiz F (2022) Faces and Voices Processing in Human and Primate Brains: Rhythmic and Multimodal Mechanisms Underlying the Evolution and Development of Speech. Front. Psychol. 13:829083. doi: 10.3389/fpsyg.2022.829083 While influential works since the 1970s have widely assumed that imitation is an innate skill in both human and non-human primate neonates, recent empirical studies and meta-analyses have challenged this view, indicating other forms of reward-based learning as relevant factors in the development of social behavior. The visual input translation into matching motor output that underlies imitation abilities instead seems to develop along with social interactions and sensorimotor experience during infancy and childhood. Recently, a new visual stream has been identified in both human and non-human primate brains, updating the dual visual stream model. This third pathway is thought to be specialized for dynamics aspects of social perceptions such as eyegaze, facial expression and crucially for audio-visual integration of speech. Here, we review empirical studies addressing an understudied but crucial aspect of speech and communication, namely the processing of visual orofacial cues (i.e., the perception of a speaker’s lips and tongue movements) and its integration with vocal auditory cues. Along this review, we offer new insights from our understanding of speech as the product of evolution and development of a rhythmic and multimodal organization of sensorimotor brain networks, supporting volitional motor control of the upper vocal tract and audio-visual voices-faces integration. Keywords: visual speech, multimodal integration, imitation, primate social brain, speech evolution, speech development, audiovisual speech, face-voice integration INTRODUCTION This review aims to integrate seemingly disparate evidence for different kinds of communicative behaviors (i.e., imitation, speech and lip-smacking) in humans and non-human primates (NHPs). Accounting for recently proposed anatomic-functional networks involved in primates’ social interactions, we attempt to provide new avenues for understanding how speech might have arisen from phylogenetically conserved multimodal and rhythmic neural properties. Frontiers in Psychology | www.frontiersin.org 1 March 2022 | Volume 13 | Article 829083 Michon et al. Rhythmic and Multimodal Brains for Speech have progressively allowed a finer control over speech sounds production. More specifically, the greater control over complex sequences of oral and vocal articulation that characterizes human speech compared to monkeys’ vocalizations could have been strengthened during evolution by more robust and direct connections between the laryngeal motor cortex and brainstem nuclei controlling volitional vocal folds vibrations as well as lips and tongue movements. Finally, we report evidence of cross-species similarities and differences in developmental trajectories for audiovisual speech perception. Namely, during the first year of life infants show a progressive specialization of auditory (phonemes, vocalizations) and visual (faces, speaking mouths) systems for the discrimination of native input, at the cost of non-native input. This developmental pattern is known as “perceptual narrowing” and has been described in both human and NHP infants with analogous timing. Interestingly however, although human and monkey infants exhibit a similar interest for the eyes, monkeys’ infants have been shown to pay less attention to the mouths, a region of other’s faces that convey critical visual communicative cues that facilitate the auditory processing of communicative vocal behaviors and foster expressive language development. We first address long-standing issues in the field of neonatal imitation research in both human and monkey newborns. In line with recent findings, we propose that rather than being exclusively innate, imitative behaviors are largely scaffolded by sensorimotor development and domain-general associative learning of multimodal information. Importantly, we argue that the development of these early abilities is largely supported by socially rewarding interactions with others. By the mean of these interactions, infants begin to associate what is seen (visual input), with what is heard (auditory input) and performed (motor output), and to learn the sensory consequences of their own and others’ actions. The evidence reviewed in section “Cross-Species Developmental Trajectories of Multimodal Integration” suggests that this socially guided and domain-general associative learning of multimodal information begins within the first year of life and could support the perceptual attunement for native auditory and visual speech. Once the perceptual system has narrowed in favor to the native stimuli present in their environment, infants can extract the regularities of their linguistic input and learn the multimodal associations between auditory (how it sounds), visual (how it is pronounced) and articulatory (how to pronounce it) aspects of their native language. Then, we introduce the third visual pathway, a stream that was recently proposed to update the well-established model of the dual visual pathways and which is thought to be specialized for dynamic aspects of social perception. More specifically, the third visual pathway was shown to run laterally from V1 to the anterior temporal region along the superior temporal sulcus (STS) and to preferentially respond to biological movements of faces and bodies. The proponents of the third visual pathway report evidence supporting the involvement of STS in higher order social cognition, such as the recognition and understanding of others’ intentions and goals based on their actions and behaviors, including grasping movements, eye-gaze direction and facial expressions. Interestingly, the posterior portion of the STS is known to respond both to orofacial movements (i.e., speaking faces) and voices, making this region an ideal candidate to support the integration of faces and voices during audiovisual speech perception. We begin the last section by reviewing the strongly reminiscent rhythmic pattern of human speech and monkey lip-smacking. Namely, these human and NHP communicative behaviors are highly rhythmic and produced at a particular rate within the theta frequency band. Remarkably, the synchronization of voices and mouth movements was documented not only during human speech production but also during monkey lip-smacking, where the acoustic envelop of vocalizations couples with inter-lips distance, both oscillating rhythmically around 4-to-5 Hz. This synchronization was recently documented in chimpanzees and marmoset monkeys, indicating that these coupled oscillations may have been crucial for the emergence of speech and must have evolved early in the primate lineage. In section “Volitional Control of the Vocal Tract,” we emphasize on an important evolutionary adaptation of the structural connectivity of a cortico-subcortical network supporting the cognitive control of the vocal tract, which could Frontiers in Psychology | www.frontiersin.org HOMO IMITANS? METHODOLOGICAL AND THEORETICAL CONTROVERSIES Do Humans Imitate From Birth? In psychological science, imitation is understood as the ability to copy the topography of a behavior (e.g., body movements, vocal or facial expressions) observed in a third person or agent (Heyes, 2021). However, researchers distinguish several forms of imitation that may differ in the complexity of their cognitive underpinnings (Zentall, 2012). An accurate imitation requires the imitator to generate a correspondence between what is seen or heard and what is performed. In other words, crossmodal associations are needed to map the visual or auditory information provided by the model into a matching motor sequence. The main problem raised by imitation is how these sensorimotor associations are established and by means of which neurocognitive mechanisms. This problem is known as the “correspondence problem” and it is still vividly debated in the scientific community. Since the late 1970s, influential works have argued that the ability to imitate is already present in neonates from 2-to-3 weeks old who successfully imitate facial gestures such as tongue or lip protrusion and mouth-opening (see Figure 1A; Meltzoff and Moore, 1977, 1997; Meltzoff, 1988). These results led to introduce the popular idea of an innate, hardwired module for imitation and human infants started to be considered as “Homo imitans” (Meltzoff, 1988). Although debated for several decades, it was not until recently that neonate imitation became one of the most controversial phenomena in the field of developmental cognitive science (Kennedy-Costantini et al., 2017; Heyes et al., 2020; Davis et al., 2021). The skepticism around the idea that imitation is in our genes arose with several studies showing that neonates elicit facial gestures in response to different kind of stimuli (Jones, 2017; 2 March 2022 | Volume 13 | Article 829083 Michon et al. Rhythmic and Multimodal Brains for Speech FIGURE 1 | (A) Human and (B) chimpanzee neonates imitating orofacial gestures (left panel: tongue protrusion; middle panel: mouth-opening; right panel: lips protrusion) [(A) Reprinted with permission from Meltzoff and Moore (1977) and (B) Reprinted with permission from Myowa-Yamakoshi et al. (2004)]. (C) A twenty-eight-week gestational age fetus producing aerodigestive stereotypies Reprinted with permission from Kurjak et al. (2004). sizes (Davis et al., 2021) shed serious doubts on the reliability of the evidence supporting the notion of Homo imitans. They demonstrated that the results of neonatal imitation research present an important heterogeneity that cannot be explained by methodological factors but is rather modulated by a “researcher affiliation” effect, with some laboratories being more likely to report larger effects. Finally, it is a possibility that a publication bias in the field may have increased the propensity for positive results to get published and negative ones unpublished (Ferguson and Heene, 2012; Heyes, 2016; Slaughter, 2021). Keven and Akins, 2017). For example, 4-week-old infants were as likely to elicit tongue protrusion when listening to music or seeing flashlights as when observing a model performing tongue protrusion (Jones, 1996, 2006), suggesting that the production of such gestures are not specifically intended to be imitative behaviors. More crucially, a recent longitudinal study involving more than 100 newborns failed to find evidence of imitation for any of the 9 action-types tested at 1, 3, 6 and 9 weeks of life using the same method as the inaugural works of the 1970s (Oostenbroek et al., 2016). This year, a meta-analysis of 336 effect Frontiers in Psychology | www.frontiersin.org 3 March 2022 | Volume 13 | Article 829083 Michon et al. Rhythmic and Multimodal Brains for Speech Do Non-human Primate Neonates Imitate? years of life. An alternative explanation we are more inclined to, formulated by detractors of neonatal imitation, propose that imitative behaviors require sensorimotor learning which instead start to emerge at the end of the first year and extend over infancy and childhood (Jones, 2017; Slaughter, 2021). In a recent article that received more than 20 peer commentaries (most of which agreed that evidence for neonatal imitation is unreliable), Keven and Akins (2017) proposed that the orofacial gestures observed in neonatal imitation research, specifically tongue protrusion and mouth opening, are in fact motor stereotypies associated with perinatal aerodigestive development in mammalians. These stereotypies begin during gestation and last until respiratory and swallowing systems begin to prepare for the introduction of solid food, around month 3. As depicted in Figure 1C, ultrasound images of fetuses have shown that a variety of the orofacial gestures discussed above are already consolidated at approximately 28 weeks of gestational age (De Vries et al., 1984; D’Elia et al., 2001; Hata et al., 2005). Since these gestures are spontaneously produced both in the womb (without any model) and perinatal life but disappear around 3 months, neonatal imitation could represent an epiphenomenon better explained by sensorimotor development. Crucially for the purpose of the current review, Keven and Akins (2017) also proposed that perinatal stereotypic gestures participate in the acquisition of orofacial motor control that, in turn, may support not only swallowing of solid food but also motor biomechanics for speech-like sounds production emerging by month 3 (also see Choi et al., 2017; Mayer et al., 2017). Whether or not the neonatal imitation abilities observed in human infants are present in NHPs has been particularly challenging to evidence with robust results. A study conducted on two chimpanzee neonates younger than a week of age revealed that they were able to imitate different types of human orofacial gestures (see Figure 1B). The authors claimed that, because of their very young age, the chimpanzees had very few opportunities for learning visuomotor associations, suggesting that they “are born with the ability to match visually perceived oral gestures with a proprioceptive motor scheme” (Myowa-Yamakoshi et al., 2004). Similarly, Ferrari et al. (2006) tested a group of 21 infant rhesus macaques at the age of 1, 3, 7 and 14 days and reported imitative behaviors for 2/6 of the actions tested, namely lipsmacking and tongue protrusion. It is noteworthy, however, that these two oral gestures were imitated only at 3 days of age, nor earlier nor later (Ferrari et al., 2006). Nearly around the same time when the concept of Homo imitans began to be severely questioned, a study performing a re-analysis of data for neonatal imitation in rhesus macaques revealed no supporting evidence. Redshaw (2019) claimed that the gold standard cross-target approach, which controls that gestures are exhibited specifically in response to the same modeled action, is not correctly implemented in most the studies of the phenomenon. Importantly, he re-analyzed the dataset of the 163 individuals ever tested to date using cross-target analysis and demonstrated that correct matching tongue protrusion and lip-smacking responses in macaque neonates were not produced at levels greater than chance (Redshaw, 2019). For instance, lipsmacking was produced at the same odds in response to observed lip-smacking and mouth-opening. Similarly to the unspecific human neonates’ tongue-protrusion behaviors in response to the same action, to music or flashlights, this study rules out the possibility that such gestures are actually imitative. Although the debate is far from being solved (Meltzoff et al., 2018, 2019; Oostenbroek et al., 2018), the controversy at the heart of the field has strongly challenged the existence of neonatal imitation abilities in both human and NHPs. Imitation, Mirror Neurons and Communication An increasing number of studies using causal (transcranial magnetic stimulation; TMS) and lesion methodologies demonstrate that brain areas typically displaying mirror properties are involved in imitation. It has been shown that inhibitory repetitive TMS of the inferior frontal gyrus (IFG) specifically impairs imitative behaviors (Heiser et al., 2003; Catmur et al., 2009) and that excitatory stimulation of the same area improves vocal imitation (Restle et al., 2012). Other mirror neuron areas of the precentral gyrus and inferior parietal region are thought to be implicated as well (Binder et al., 2017; Reader et al., 2018). Similar to the debated innateness of imitation, the origins of mirror neurons have been the object of an intense nature vs. nurture debate. Importantly, the proponents of the mirror neuron theory take neonatal imitation as evidence for the presence of mirror properties from birth and suggest that they are part of an innate system for action-perception (Simpson et al., 2014). On the other hand, accordingly to those who defend that imitation emerges later during infancy, “neurons acquire their mirror properties through sensorimotor learning” (Heyes and Catmur, 2022). Mirror neurons were originally observed when visuomotor neurons in the monkey premotor cortex began to fire not only when a monkey executed a grasping task but also when it observed the researcher performing this grasping behavior (di Pellegrino et al., 1992). While for methodological reasons in humans there is little direct evidence for mirror In-Born Module for Imitation or Sensorimotor Development? Similar developmental trajectories of imitation were documented for humans and chimpanzees. Several studies have shown that tongue protrusion imitation observed during the first few weeks after birth in both species progressively disappear around the end of the second month of life (Abravanel and Sigafoos, 1984; Myowa-Yamakoshi et al., 2004; Subiaul, 2010; Jones, 2017). Some authors advocating for neonatal imitation explain that this decrease in the incidence of orofacial imitation is “probably due to the maturation of the cortical mechanisms inhibiting unwanted movements that follows the development of the organization of motor control [. . .] and reappears at an older age in terms of intentional imitation” (Rizzolatti and Fogassi, 2016, p.382). Although it is unclear whether imitation is present from birth, it is undeniable that this faculty develops within the first Frontiers in Psychology | www.frontiersin.org 4 March 2022 | Volume 13 | Article 829083 Michon et al. Rhythmic and Multimodal Brains for Speech organization in the human brain, with a dorsal and ventral pathway distinguishable both anatomically and functionally. The dorsal stream also known as the “where and how” stream, projects from early visual cortices and reaches the prefrontal cortex running along the parietal lobe. This stream was proposed to underly the processing of visual information about objects’ spatial location and the execution of actions related to these objects. The ventral stream, also known as the “what” stream, runs from early visual cortices toward the inferior temporal lobe and is widely thought to support object identification (e.g., animals, cars, faces). The two-visual pathways model has not only been one of the most influential models for visual system organization in the brain, but it has also influenced important models of auditory cortical processing (Kaas and Hackett, 1999; Romanski et al., 1999; Romanski, 2007), attentional networks (Corbetta and Shulman, 2002) and the neurobiology of language (Hickok and Poeppel, 2004, 2007) in which dorsal and ventral streams are described accordingly to their “where and how” and “what” functions, respectively. In the particular case of language processing in the brain, the dorsal pathway is proposed to connect posterior superior regions of the temporal lobe with the frontal cortex, allowing the mapping of speech sounds with the orofacial articulatory sequences required to produce these sounds. The ventral pathway, connecting posterior to anterior areas of the middle and inferior temporal gyri, is believed to support the mapping of speech sounds onto linguistic meaning (Hickok and Poeppel, 2004). Last year Leslie G. Ungerleider, who first reported the dual organization of visual processing in primates’ cortex (Ungerleider and Mishkin, 1982), and David Pitcher reported compelling evidence for the existence of a third visual pathway and claimed that the two-visual pathways model needs to be updated (Pitcher and Ungerleider, 2020). Reviewing evidence coming from fMRI, TMS, lesion, tracers and tractography studies, they proposed that this third visual pathway is anatomically and functionally segregated from the existing dorsal and ventral streams, projecting on the lateral part of both human and NHP brains and specialized for social perception. Originating in the primary visual cortex (V1), the third pathway sends projections into the posterior and anterior portions of the superior temporal sulcus (pSTS and aSTS, respectively) via the area V5/MT (see Figure 2), an area well known for its responsiveness to visual motion. In both monkeys and humans, the aSTS displays selective responses to moving but not to static faces and bodies (Zhang et al., 2020), a functional characteristic that differs from those face areas of the ventral stream (which include the occipital and fusiform face areas for a more static and structural identification of faces). Altogether the evidence reported by the authors emphasizes the role of this lateral pathway in the processing of a wide range of socially relevant visual cues and, by extension, in higher order social perception. For instance, based on the eye-gaze direction or hand movements of our interlocutors, humans are able to generate predictions about their goals and intentions. In other words, the existence of a third visual pathway specialized for the perception of facial and corporal dynamics may have supported the human brain readiness for social interactions. neurons, a mirror system has been proposed to be involved in the simulation of others’ behaviors, providing a “view from the inside” of the observed conduct (Rizzolatti and Craighero, 2004; Rizzolatti and Sinigaglia, 2008). After these findings, mirror neurons were proposed by some authors to represent the neural mechanism involved in imitation skills (Cross et al., 2009; Iacoboni, 2009). Nonetheless, it remains unclear whether mirror neurons emerge from some modular, inherited mechanism where the others’ behavior is somehow represented in the mirror neuron system, or whether they result from domain-general processes like associative learning. One view is that grasping mirror neurons participate in hand visuomotor control, which by associative mechanisms may extend to the observation of others beside the own hand (Oztop and Arbib, 2002; Kilner et al., 2007). Once their function has been amplified to the observation of others’ behaviors beside the own, the motor programs become modulated by the former resulting in progressive imitation. As opposed to the representational view, this perspective provides a mechanistic interpretation of the mirror neuron mechanisms based on known processes of neuronal plasticity and development (Aboitiz, 2017, 2018b). Mirror neurons have also been proposed to play an important role for communication and social cognition in both humans and NHPs. Specular activity between interacting individuals is thought to be a mechanism contributing to the formation of social bonds, especially between caregivers and their offsprings. Mother-child dyads observation, for instance, revealed that mothers actually imitate their infants’ facial gestures and vocalizations to a greater extent than infants imitate their parents (Jones, 2006; Athari et al., 2021). Parental imitative behaviors offer a form of reward-based learning for infants that may reinforce the elaboration of early learned associations between the self-generated motor sequences and the resulting perceptual outcomes—visual outcomes for imitative facial gestures but also auditory outcomes for vocal imitation—in the other person. Crucially, until they are exposed to real mirrors, infants have no visual feedback over their own face when gesturing (unlike for their arms and legs movements) and therefore, could use caregivers’ imitations as “social mirrors” to gain knowledge into crossmodal mapping (Ray and Heyes, 2011). In sum and based on the evidence revised above, we argue that imitation as well as speech are social abilities that develop during infancy alongside with sensorimotor systems and require associative learning of multimodal input. The purpose of the following sections of this review is to emphasize on the importance of these crossmodal associations between what is performed, what is seen and what is heard (motor-visualauditory) for the evolution and development of human speech. A BRAIN NETWORK FOR DYNAMIC FACES AND VOICES PERCEPTION A Third Visual Pathway? Forty years ago, Ungerleider and Mishkin (1982) evidenced that the primate visual cortex is organized in two streams. A decade later, Goodale and Milner (1992) demonstrated a similar dual Frontiers in Psychology | www.frontiersin.org 5 March 2022 | Volume 13 | Article 829083 Michon et al. Rhythmic and Multimodal Brains for Speech FIGURE 2 | Updated version of the visual streams model: The ventral and dorsal pathways are represented by the blue and yellow arrays, respectively, and the third visual pathway proposed by Pitcher and Ungerleider (2020) is depicted in green. While these authors emphasized the role of the third pathway in the right hemisphere (left panel), in this article we are focusing on its functions in the left hemisphere (right panel). A Possible Function for the Third Visual Pathway in the Left Hemisphere to eye-movements but not to voices (Zhu and Beauchamp, 2017; Rennig and Beauchamp, 2018). The latter suggests that vocal sounds and the orofacial movements that produce them are integrated in the anterior pSTS. In line with this functional specialization, a recent study reported homologous representation of conspecific vocalizations in bilateral auditory cortices of humans and macaques. More specifically, this temporal voice area is located in the anterior temporal lobe, dorsally to STS (Bodin et al., 2021). It is noteworthy that, before the third visual pathway for social perception was formally proposed, neurobiological models of audiovisual speech processing already had included the left MT/V5 and pSTS as critical areas (Bernstein and Liebenthal, 2014; Beauchamp, 2016; Hickok et al., 2018). Additionally, the STS has been proposed to be critical for semantic processing, serving as an interface between the auditory component of speech perception and the visual recognition system, providing a substrate for the representation of content words and scenes containing schemas of agents and objects (Aboitiz, 2018a). Although a great emphasis was made on the right STS, the authors were more elusive with respect to the role of the third visual pathway in the left hemisphere. In fact, they leave the following questions open: “Is the third pathway lateralized to the right hemisphere in humans? If so, what are the visual functions of the left STS and what is the role of speech?” (Pitcher and Ungerleider, 2020). Here, we advocate for the existence of a third visual pathway for social perception in the left hemisphere and review evidence of the special role of STS for the evolution of multimodal integration of speech. Decades of research on the STS have consistently demonstrated that it supports the audiovisual integration of faces and voices. Neuronal populations of the macaque STS have been shown to respond to both auditory and visual stimuli, especially when the heard vocalizations matched the seen mouth movements. Interestingly, this pattern of responses for face/voice perception has been observed in the right (Perrodin et al., 2014) and the left hemisphere (Ghazanfar et al., 2005, 2008). More recently, in a study using single neuron recordings of face patches in macaques’ left (n = 3) and right (n = 1) hemisphere, Khandhadia et al. (2021) reported greater responses to audiovisual stimuli in the face patch AF (in the aSTS) with respect to AM (in the undersurface of the temporal lobe). These results are consistent with the functional distinction between a lateral visual pathway specialized in social perception of moving faces and a ventral pathway dedicated to more static, structural and unimodal aspects of face processing. In humans, both right and left STS have been reported to process communicative facial and vocal cues, with preferential responses to audiovisual face-voice stimuli and no responses to manual gestures (Deen et al., 2020). Other fMRI studies have reported that different areas of the pSTS are responsive to mouth and eye movements (Puce et al., 1998). Interestingly, only the anterior portion that prefers mouth-movements elicited strong responses to voices, contrasting with the posterior portion who responded Frontiers in Psychology | www.frontiersin.org EVOLUTION AND DEVELOPMENT OF MULTIMODAL INTEGRATION IN THE PRIMATE BRAIN The Rhythmic Evolution of Communication: From Lip-Smacking to Human-Speech Rhythm Speech is produced rhythmically and its temporal structure remains stable across languages, within the range of 2-to-7 Hz with a notable peak in the theta frequency band between 4 and 5 Hz (Poeppel and Assaneo, 2020). Interestingly, the spectral frequency of the speech envelope corresponds to the rate of syllable production (Park et al., 2016). In turn, the acoustic envelopes of speech and orofacial speech movements seem to be tightly time-locked, both modulated in the 2-to-7 Hz 6 March 2022 | Volume 13 | Article 829083 Michon et al. Rhythmic and Multimodal Brains for Speech exploits the statistical regularities present in the audiovisual input to improve speech perception/comprehension (Figure 3). There is now a growing body of studies revealing a similar temporal structure is present in NHP communication. Primates’ vocalizations and communicative calls have been shown to synchronize with the rhythm of facial expressions, such as mouth opening/closing during lip-smacking behaviors. This synchronization between vocalizations and lips movements has been reported in marmosets, macaque rhesus monkeys and chimpanzees. Critically for evolutionary accounts of audiovisual speech perception, it appears to be phase-locked in the theta band frequency, matching the syllable production rate observed in humans at approximately 4 Hz (Ghazanfar et al., 2013; Ghazanfar and Takahashi, 2014a,b; Gustison and Bergman, 2017; Pereira et al., 2020; Risueno-Segovia and Hage, 2020). The NHP brain is highly tuned to facial expressions accompanying affiliative calls and, similar to humans, take advantage of orofacial visual cues to speed up auditory processing and to enhance the perception of vocalizations in noisy environments (Chandrasekaran et al., 2011, 2013). Interestingly, the neural mechanisms underlying these behavioral advantages seem to be similar across species, reflected by reduced or suppressed responses in auditory neurons for multimodal compared to unimodal auditory perception (Ghazanfar and Lemus, 2010; Kayser et al., 2010). Altogether the evidence reviewed above demonstrates that both human and NHPs communicate rhythmically, producing coordinated vocalizations and orofacial gestures around 4–5 Hz. Their neural oscillations synchronize to this frequency and take benefit from the consistency of audiovisual regularities in voice onset and mouth opening co-occurrence. Noticeably, the syllable production rate observed across all human languages is already present in the marmoset lip-smacking, suggesting that rhythmic communication may have evolved early in the primate lineage. frequency range. Chandrasekaran et al. (2009) have measured and correlated the speech envelope with the area of mouth opening associated to spontaneous production in English and French audiovisual speech datasets. Their analysis revealed robust correlations between inter-lip distance and speech sounds amplitude but also a consistent interval of 100-to-300 ms between the onset of visual speech (the initial, visible lip movements) and the onset of the corresponding speech sound. This mouth/voice orchestration suggests that, before the brain proceeds with multimodal speech processing, stable and redundant temporal information are already embedded in the audiovisual speech stream itself (Chandrasekaran et al., 2009). During face-to-face conversations, humans take advantage of visual information provided by the speaker’s mouth movements to facilitate speech comprehension, especially when the surrounding environment is noisy (Sumby and Pollack, 1954; also see Crosse et al., 2015). Recent studies have begun to uncover the underlying mechanisms of audiovisual integration in the human brain. Electrophysiological recordings have reported that visual speech speeds up the processing of auditory speech (Van Wassenhove et al., 2005) and allows crossmodal predictions (Michon et al., 2020). This temporal facilitation is consistently reflected by shorter latencies and lower amplitudes of the auditory components N1 and P2 [see Baart (2016) for a critical review]. Interestingly, the facilitation effect and crossmodal predictions are more pronounced for those visual speech cues with salient places of articulation in the upper vocal tract (e.g., bilabial consonant-vowel/ba/) with respect to those produced in the lower vocal tract which are visually less salient (e.g., velar consonant-vowel/ga/). The analysis of oscillatory brain activity has also offered critical insights with respect to audiovisual integration and crossmodal predictions. Using magnetoencephalography, Park and collaborators demonstrated that the perception of speaking lips entrains visual cortex oscillations and modulates the activity of the auditory cortex (Park et al., 2016). In line with these results, a recent study using intracortical recordings reported that neurons of the auditory cortex track the temporal dynamics of visual speech cues based on their phase of oscillations (Mégevand et al., 2020). Another intracortical study found a sub-additive effect in which responses to audiovisual speech were weaker compared to auditory speech only in the left posterior superior temporal gyrus, suggesting that visual speech optimizes auditory processing efficiency (Metzger et al., 2020). Importantly, a partial coherence between the left motor region oscillations and lip movements rate have also been identified that directly predicted the participants performance on comprehension, suggesting that motor cortex could facilitate the integration of audiovisual speech through predictive coding and active sensing (Park et al., 2016, 2018). Several recent studies have proposed that visual cortex entrainment to rhythmic lip motion modulates the responses of auditory cortex via theta phase synchronization (Crosse et al., 2015; Zoefel, 2021; see Figure 3), including when visual speech only is presented (Bourguignon et al., 2020; Biau et al., 2021). Human speech is rhythmic and multimodal; our voices and mouth movements are temporally coordinated when we speak and the oscillatory activity of our brain couples with and Frontiers in Psychology | www.frontiersin.org Volitional Control of the Vocal Tract Additionally to their analogous rhythmic patterns, the production of human speech and primate lip-smacking involves a common cortical network including the IFG, the ventrolateral and dorsomedial prefrontal cortex (vlPFC and dmPFC) in humans and NHPs (Rizzolatti and Craighero, 2004; Petrides, 2005; García et al., 2014; Neubert et al., 2014). These shared anatomico-functional properties across species are in line with previous cytoarchitectonic studies establishing the vlPFC as the NHP homolog of Broca’s area, both structures being responsible for the initiation of vocal communicative behaviors (Petrides and Pandya, 2002; Petrides et al., 2005). In macaques, cognitive control required to produce volitional vocalizations has been shown to consistently recruit the IFG (Gavrilov et al., 2017; Loh et al., 2017; Shepherd and Freiwald, 2018). Other studies, using single neuron recordings, confirmed that the vlPFC elicits dedicated responses during volitional initiation of vocalizations (Hage and Nieder, 2013; Gavrilov and Nieder, 2021). Recent research in humans indicates that left vlPFC and premotor cortex also supports the control of voluntary orofacial movements (Loh et al., 2020; Maffei et al., 2020). This evidence suggests that the inferior frontal region has an ancestral role for orofacial (lip-smacking) and vocal (affiliative 7 March 2022 | Volume 13 | Article 829083 Michon et al. Rhythmic and Multimodal Brains for Speech FIGURE 3 | Rhythmic properties of audiovisual speech and cortical oscillations. Visual and auditory speech cues as well as neural oscillations are depicted in green and yellow, respectively. The blue array in the bottom panel indicates feedforward modulation of auditory cortex responses via theta synchronization to visual cortex oscillatory rhythm. and communicative behaviors but extends to perception as well. In humans as in NHPs, the visual and auditory ventral pathways project axonal terminals into vlPFC (Romanski, 2007; Hage and Nieder, 2016). In line with this structural overlap, a neural population was found in the vlPFC of rhesus monkeys that responds to the perception of both conspecific faces and vocalizations (Sugihara et al., 2006; Romanski, 2012; Diehl and Romanski, 2013) and is also recruited when monkeys produced vocalizations (Hage and Nieder, 2015). Moreover, a recent study using electric stimulation combined with fMRI revealed a common effective connectivity between auditory cortex and vlPFC in human and monkey brains (Rocchi et al., 2021). These results turn the vlPFC into a phylogenetically conserved trimodal region for the integration of audiovisual and motoric aspects of calls) control in NHP communication, which could be regarded as a phylogenetic precursor of human speech control. Because human vocalizations are much more complex, it was argued for decades that primate lip-smacking and orofacial communication could not have served as an evolutionary building block of human speech. More recently however, accounting for the above-mentioned evidence, emerging theories are advocating for a common evolutionary origin of vocal-facial communicative gestures that could have arisen well before the hominin radiation (Aboitiz and García, 1997; Morrill et al., 2012; Ghazanfar, 2013; Ghazanfar and Takahashi, 2014a,b; Shepherd and Freiwald, 2018; Michon et al., 2019; Brown et al., 2021). Importantly, the phylogenetic role of vlPFC is not limited to the control of orofacial effectors for the production of speech Frontiers in Psychology | www.frontiersin.org 8 March 2022 | Volume 13 | Article 829083 Michon et al. Rhythmic and Multimodal Brains for Speech conserved anatomical and functional features, we argue that the vlPFC plays a critical role in the integration of audiovisual and motoric aspects of communication and may have contributed to the emergence of human speech. Nevertheless, important crossspecies differences have been documented in the connectivity between LMC and brainstem nuclei, specifically the connections to the ambiguous nucleus are more robust and direct in human brains compared to NHP brains. This difference of connectivity strength could explain why human speech has evolved toward more complex vocal and orofacial sequences compared to NHP lip-smacking (Brown et al., 2021). communication that may have contributed to the emergence of human speech (Michon et al., 2019). It is noteworthy that the synchronization of speaking mouths and voices around the 4.5 Hz has been proposed to emerge as a consequence of an intrinsic speech-motor rhythm observed in humans (Assaneo and Poeppel, 2018). In other words, mouth movements and vocalizations couple around the same frequency band because they both represent the sensory consequences of complex sequences of the orofacial effectors and vocal tract movements, which are produced at this particular rhythm. Using principal component analysis to investigate the joint variation of facial and vocal movements, a recent study combining videos of human faces articulating speech and MRI sequences of the speaker’s vocal tract has shown that sufficient information is available in the configuration of a speaking face to recover the full configuration of the vocal tract (Scholes et al., 2020). The part of the face that contributes the most to the recovery of vocal tract configuration are those parts who are required to produce speech sounds (e.g., upper and lower lips for bilabial phonemes or the back of the tongue for velar phonemes). In humans, the LMC is thought to be located in the primary motor cortex, more specifically in area 4 and to have direct monosynaptic projections to the ambiguous nucleus, the seat of laryngeal motoneurons in the brainstem controlling the vibration of the vocal cords. In NHPs by contrast, the LMC is located in the area 6 of the premotor cortex and connects to laryngeal motoneurons only indirectly via interneurons of the reticular formation (Simonyan and Horwitz, 2011; Simonyan, 2014). Additionally, tractography analyses have revealed that human LMC connectivity with somatosensory and inferior parietal cortices are strongly enhanced compared to its NHP homolog (Kumar et al., 2016). The latter suggests that the evolution of LMC connectivity with both brainstem nuclei and temporoparietal cortex may have contributed to a greater control over the vocal tract for volitional vocalizations and to higherorder sensorimotor coordination in response to social perception demands, respectively. Recently, both anatomic and functional research have proposed a division of the human LMC into a dorsal and a ventral portion (Belyk et al., 2021; Hickok et al., 2021). The dorsal laryngeal motor cortex (dLMC) has been shown to be causally involved in the control of laryngeal muscles involved in voluntary vocalizations and vocal pitch modulations used to convey meaning in human speech production (Dichter et al., 2018). The dLMC shows greater connectivity and a consistent role in laryngeal motor control whereas the ventral one has fewer projections, suggesting that it could be part of the premotor cortex as NHPs’ LMC (Dichter et al., 2018; Eichert et al., 2020). Even though it was recently associated with verbal fluency in individuals who stutter (Neef et al., 2021) and with respiration coordination for vocal-motor control (Belyk et al., 2021), the function of the ventral LMC remains mostly unknown. In sum, the evidence reviewed in this section indicates that humans and NHPs present structural and functional homologies for the volitional control of the vocal tract in the vlPFC. Crucially, in addition to its role in vocal production, this region also responds to the perception of vocalizations and orofacial movements in both species. According to its phylogenetically Frontiers in Psychology | www.frontiersin.org Cross-Species Developmental Trajectories of Multimodal Integration One of the first multimodal associations that an infant must learn is the matching between her caregivers’ faces and voices. During their first months of life, human infants are capable to discriminate a wide variety of non-native stimuli but loose this ability by the end of the first year. This counterintuitive developmental pattern of perception is known as perceptual narrowing and has been described for speech sounds, faces (Kuhl et al., 2006; Krasotkina et al., 2021) and music (Hannon and Trehub, 2005). For instance, 6-to-8 but not 10-to-12 months old English infants were capable to discriminate non-native phonemic contrasts (Werker and Tees, 1984). At a similar developmental timing, the same phenomenon occurs for nonnative faces, including faces from different races (Kelly et al., 2007) or species (Pascalis et al., 2002). Interestingly, the visual discrimination of speech is also subject to a perceptual narrowing between 6 and 11 months of age (Pons et al., 2009). An accepted interpretation of this regression in the perception of non-native stimuli propose that the visual and auditory systems are progressively tuning in favor of the particular input infants are exposed to (i.e., native faces and speech sounds). The refinement of perception for conspecific’s voices and faces is thought to optimize the processing of the relevant information used within one’s native social group (Lewkowicz and Ghazanfar, 2009). As mentioned above, monkey lip-smacking and human speech converge on a ∼5 Hz rhythm but they were also demonstrated to share homologous developmental mechanisms strongly supporting “the idea that human speech rhythm evolved from the rhythmic facial expressions of our primate ancestors” (Morrill et al., 2012, p.3). In both NHPs and humans, environmental variables seem to foster the development of social perception skills. Dahl et al. (2013) investigated the development of face perception in a colony of captive young and older chimpanzees with lifelong exposure to non-conspecific faces (human scientists) and showed that younger apes discriminate conspecific faces better than human faces, but older apes elicited the opposite pattern, discriminating better human than conspecific faces. The results suggest the existence of early mechanisms that favor perception tuning toward native-species stimuli and of late mechanisms that narrow the perceptual system along with the critical information of the faces frequently encountered in the environment (for older captive monkeys, human faces). Controlling for genetics, perinatal experience 9 March 2022 | Volume 13 | Article 829083 Michon et al. Rhythmic and Multimodal Brains for Speech Noticeably, within their first year of life, infants of both species show a progressive attunement for the processing of native or species-specific visual (faces) and auditory (vocalizations) social stimuli. and growth, a study conducted on infant marmoset twins who were exposed to different amount of social reinforcement demonstrated that infants receiving more contingent parental feedback show an increased rate of vocal development with respect to their twins who were provided less contingent feedback (Takahashi et al., 2017). Another example of the role of experience are human infants raised in bilingual environment, who exhibit a prolonged perceptual narrowing (Werker and Hensch, 2015). Bilingually raised infants were able to discriminate non-native speech sounds, which age-matched monolingual infants were no longer able to discriminate (Petitto et al., 2012; ByersHeinlein and Fennell, 2014; also see Kuhl et al., 2003). This influence of linguistic exposure has also been reported for visual discrimination of speech (Weikum et al., 2007; Sebastián-Gallés et al., 2012). It is known that around the sixth month, when infants start babbling, they start to spend more time looking at the part of the face that conveys linguistic information (i.e., the mouth) and that visual attention returns to the eyes around the end of the first year when they have formed their native phonological repertoire (Lewkowicz and Hansen-Tift, 2012). Bilingual infants attend more to the mouth than to the eyes of a speaking face from an earlier age and for a longer period of time, taking advantage of the multimodal input to support the acquisition of their two languages and respective phonological repertoires (Pons et al., 2015). It is noteworthy that the additional linguistic information provided by lip movements has recently been demonstrated to foster expressive language skills during the second half of the first year (Tsang et al., 2018) and improve the learning and recognition of novel words in 24 months-old monolinguals and bilingual toddlers (Weatherhead et al., 2021). Interestingly, this preferential orientation of visual attention toward the mouth has been reported in adults as well; when exposed to their second non-native language, adults attend more to the speaking mouth independently of their level of proficiency (Birulés et al., 2020). Adjusting for between species difference in developmental timescale, a recent study compared infant rhesus macaques’ and human infants’ face processing strategies revealing a highly similar U-shape pattern of changes in visual engagement with the eyes of unfamiliar conspecifics. However, they also showed that human infants visually engage with the mouth to a greater extent than macaque infants do, suggesting that the process of language acquisition may require an increased reliance on the information conveyed by orofacial movements (Wang et al., 2020). Using functional near-infrared spectroscopy, Altvater-Mackensen and Grossmann (2016) reported that 6-month-old infants who prefer to look at speakers’ mouths exhibit enhanced responses in the left inferior frontal cortex compared to those infants who prefer the eyes of a speaker. Accordingly with the functions of the IFG discussed above (see section “Volitional Control of the Vocal Tract”), the authors conclude that this region plays a crucial role for multimodal association during native language attunement (Altvater-Mackensen and Grossmann, 2016). Taken together the evidence supports the idea that, despite some differences of rate due to their heterochronous neural development, humans and NHPs share similar developmental trajectories for multimodal integration of social stimuli. Frontiers in Psychology | www.frontiersin.org DISCUSSION The current review addresses the rhythmic and multimodal aspects of communication and brain mechanisms that could have scaffolded human brain readiness for social interactions during evolution. Particular emphasis was placed on the importance of sensorimotor development, in domain-general associative learning of multimodal information and in socially rewarding interactions for the development of communicative behavior like imitation and speech during infancy. On the other hand, we integrate recent evidence of anatomical and functional homologies and differences between humans’ and non-human primates’ social brain, specifically for the perceptual processing of dynamic social cues (such as voices and faces) and for the volitional control of the vocal tract. We propose to synthesize the findings of this review around 5 questions that, in our view, contribute to better understand the domain-general mechanisms and properties of the primate brain underlying the evolution and development of speech. In-Born Module for Imitation or Sensorimotor Development? We began this review by addressing the controversies surrounding the longstanding theory of neonatal imitation in humans and NHPs. Recent data re-analysis and meta-analysis have raised serious issues concerning the reliability of the gold-standard methods used in neonatal imitation research. As a consequence, the idea of Homo imitans with innate imitative abilities has been strongly challenged. Alternatively, imitation may rely on crossmodal associations of sensorimotor information (e.g., visuomotor associations for facial imitation and audiomotor associations for vocal imitation). This article surveys evidence from developmental psychology, comparative neuroanatomy, and cognitive neuroscience indicating that human imitation and language are the result of brain adaptations shaped predominantly by cultural evolution. Rather than being an exclusively innate ability, the evidence reviewed points toward the idea of imitation as an ability that develops during infancy and childhood, supported by the maturation of sensorimotor brain networks and domain-general associative learning of multimodal information, both fostered by socially-rewarding interactions. What Is the Role of the Mirror Neuron System for Imitation and Communication? Iacoboni and Dapretto (2006) proposed a neural circuit for imitation that includes the pSTS where visual input is processed and sent to the inferior parietal lobule, which is concerned with the motoric aspect of the action and projects into the 10 March 2022 | Volume 13 | Article 829083 Michon et al. Rhythmic and Multimodal Brains for Speech generator, namely the vocal tract. Then, we surveyed humans and NHPs structural and functional homologies for the volitional control of the vocal tract in the vlPFC. Crucially, this region also responds to the perception of vocalizations and orofacial movements in both species, converting the vlPFC into a potential phylogenetically conserved trimodal region for the integration of audiovisual and motoric aspects of communication that may have contributed to the emergence of human speech (Aboitiz and García, 1997). Important cross-species differences have been documented, however, in the pattern of connectivity between LMC and brainstem nuclei. More specifically, the connections with those nuclei that control the muscles engaged in vocal folds vibrations and orofacial movements are more direct and robust in human brains compared to NHP brains. The strengthening of this structural connectivity across species evolution may have contributed to the development of finer vocal and orofacial motor control required for both imitation and speech production. IFG and ventral premotor cortex, where the goal of the action is recognized. Importantly, they also claim the existence of “efference copies of motor imitative commands that are sent back to the STS to allow matching between the sensory predictions of imitative motor plans and the visual description of the observed action” (Iacoboni and Dapretto, 2006). This network represents a suitable candidate to coordinate the processing of visual information and the execution of the corresponding motor sequence required for the imitation of facial expressions, such as lip or tongue protrusion. It is noteworthy that the areas involved in this circuit widely overlap with well-established regions of the mirror neuron system. The findings of the current review point toward a substantial role of the mirror properties of these brain areas to support the learning of multimodal association. Does the Third Visual Pathway in the Language-Dominant Hemisphere Play a Role for Audiovisual Integration of Speech? What Do Species-Specific Sensory Development Can Tell Us About the Evolutive Origins of Speech? As discussed in the third section of this review, recent evidence suggests that the pSTS is part of a third visual pathway that plays a critical role for social perception. Since it is specialized for the processing of biological movements in both human and NHPs, this area seems highly suited for gesture and facial expression imitation. In the left, language-dominant hemisphere, neural populations of the pSTS preferentially respond to both orofacial movements and vocalizations. For instance, regions that respond preferentially to mouths (vs eyes) also fire in response to conspecific voices. In line with the latter, the temporal voice area identified in both human and NHPs has a privileged location in the anterior temporal lobe, dorsally to STS. Although empirical studies are still needed to properly address this hypothesis, we suggest that the anatomical and functional characteristics of the third visual pathway in the left hemisphere turns it into a fitted circuit to support audiovisual integration of speech and lip-smacking. Future research in this field should investigate on brain activity lateralization during the processing of speaking faces in the other regions of the third visual pathway, namely, early visual areas (V1 and MT/V5) as well as the aSTS. We showed that despite some differences due to their neural development timing, humans and NHPs share similar developmental trajectories for multimodal integration of social stimuli. Noticeably, within their first year of life, infants of both species show a progressive attunement for the processing of native or species-specific visual (faces) and auditory (vocalizations) social stimuli. Importantly, this perceptual narrowing is highly influenced by environmental variables, such as enriched linguistic environment or the contingency of parental feedback, supporting the notion that early multimodal association learning is mediated by the engagement with socially relevant and rewarding interactions. In turn, since infants who dedicate greater attentional resources to the mouth (vs the eyes) of a speaker show greater expressive language development, we argue that the visual speech cues offered by speakers’ mouth movements are an important part of linguistic input during infancy and childhood, that benefits both language perception and production. Crucially, the prolonged wearing of opaque facemasks in nurseries and pre-school teachers in the context of current global pandemic may have adverse consequences for infants’ language acquisition, especially those with language learning impairments, since visual speech cues are no longer accessible in a speaker wearing an opaque mask. Finally, as mentioned at the end of the last section, human infants visually engage with the mouth to a greater extent than macaque infants do, suggesting an increased reliance on the information conveyed by orofacial movements is required for language acquisition relatively to lip-smacking, which involves less complex articulatory sequences and vocalizations than human speech. What Can Cross-Species Homologies and Differences Tell Us About the Evolutionary Origins of Speech? In the last section of this review, we offer insights about the phylogenetic evolution and ontological development of the multimodal integration of speech, accounting for crossspecies homologies and differences in brain’s anatomy, function and developmental trajectories. We first reviewed evidence for a common evolutionary rhythm in humans and NHPs’ production of orofacial and vocal behaviors, phased-locked in the theta frequency-band with a peak around 4-to-5 Hz. It was suggested that this synchronization of visual (faces) and auditory (voices) cues during social communication emerges as a result of an intrinsic motor-speech rhythm imposed by a common Frontiers in Psychology | www.frontiersin.org AUTHOR CONTRIBUTIONS MM took the lead in writing the manuscript. JZ-A and FA provided critical feedback and helped shape the theoretical 11 March 2022 | Volume 13 | Article 829083 Michon et al. Rhythmic and Multimodal Brains for Speech analysis and the manuscript’s text. All authors contributed to the manuscript planification. No. 11201224) from the Agencia Nacional de Investigación y Desarrollo (ANID) from the Chilean government. FUNDING ACKNOWLEDGMENTS This research was supported by a post-doctoral fellowship to MM (Grant No. 3201057), by regular Fondecyt grant to FA (Grant No. 1210659), and by an initiation grant to JZ-A (Grant We would like to thank the Agencia Nacional de Investigación y Desarrollo (ANID) from the Chilean government for funding this research. REFERENCES Byers-Heinlein, K., and Fennell, C. T. (2014). Perceptual narrowing in the context of increased variation: insights from bilingual infants. Dev. Psychobiol. 56, 274–291. doi: 10.1002/dev.21167 Catmur, C., Walsh, V., and Heyes, C. (2009). Associative sequence learning: the role of experience in the development of imitation and the mirror system. Philos. Trans. R. Soc. B Biol. Sci. 364, 2369–2380. doi: 10.1098/rstb.2009.0048 Chandrasekaran, C., Lemus, L., and Ghazanfar, A. A. (2013). Dynamic faces speed up the onset of auditory cortical spiking responses during vocal detection. Proc. Natl. Acad. Sci. U S A. 110, E4668–E4677. doi: 10.1073/pnas.13125 18110 Chandrasekaran, C., Lemus, L., Trubanova, A., Gondan, M., and Ghazanfar, A. A. (2011). Monkeys and humans share a common computation for face/voice integration. PLoS Comp. Biol. 7:e1002165. doi: 10.1371/journal.pcbi.100 2165 Chandrasekaran, C., Trubanova, A., Stillittano, S., Caplier, A., and Ghazanfar, A. A. (2009). The natural statistics of audiovisual speech. PLoS Comp. Biol. 5:e1000436. doi: 10.1371/journal.pcbi.1000436 Choi, D., Kandhadai, P., Danielson, D. K., Bruderer, A. G., and Werker, J. F. (2017). Does early motor development contribute to speech perception? Behav. Brain Sci. 40:e388. doi: 10.1017/S0140525X16001308 Corbetta, M., and Shulman, G. L. (2002). Control of goal-directed and stimulusdriven attention in the brain. Nat. Rev. Neurosci. 3, 201–215. doi: 10.1038/ nrn755 Cross, E. S., Kraemer, D. J., Hamilton, A. F. D. C., Kelley, W. M., and Grafton, S. T. (2009). Sensitivity of the action observation network to physical and observational learning. Cereb. Cortex 19, 315–326. doi: 10.1093/cercor/b hn083 Crosse, M. J., Butler, J. S., and Lalor, E. C. (2015). Congruent visual speech enhances cortical entrainment to continuous auditory speech in noise-free conditions. J. Neurosci. 35, 14195–14204. doi: 10.1523/JNEUROSCI.1829-15.2015 Dahl, C. D., Rasch, M. J., Tomonaga, M., and Adachi, placeI. (2013). Developmental processes in face perception. Sci. Rep. 3:1044. Davis, J., Redshaw, J., Suddendorf, T., Nielsen, M., Kennedy-Costantini, S., Oostenbroek, J., et al. (2021). Does neonatal imitation exist? Insights from a meta-analysis of 336 effect sizes. Perspect. Psychol. Sci. 16:1745691620959834. doi: 10.1177/1745691620959834 Deen, B., Saxe, R., and Kanwisher, N. (2020). Processing communicative facial and vocal cues in the superior temporal sulcus. Neuroimage 221:117191. doi: 10.1016/j.neuroimage.2020.117191 De Vries, J. I., Visser, G. H. A., and Prechtl, H. F. (1984). Fetal motility in the first half of pregnancy. Clin. Dev. Med. 94, 46–64. D’Elia, A., Pighetti, M., Moccia, G., and Santangelo, N. (2001). Spontaneous motor activity in normal fetuses. Early. Hum. Dev. 65, 139–147. doi: 10.1016/s03783782(01)00224-9 di Pellegrino, G., Fadiga, L., Fogassi, L., Gallese, V., and Rizzolatti, G. (1992). Understanding motor events: a neurophysiological study. Exp. Brain Res. 91, 176–180. doi: 10.1007/BF00230027 Dichter, B. K., Breshears, J. D., Leonard, M. K., and Chang, E. F. (2018). The control of vocal pitch in human laryngeal motor cortex. Cell 174, 21–31. doi: 10.1016/j.cell.2018.05.016 Diehl, M. M., and Romanski, L. M. (2013). “Processing and integration of faces and vocalizations in the primate prefrontal corte,” in Integrating Face and Voice in Person Perception, eds P. Belin, S. Campanella, and T. Ethofer (New York, NY: Springer). Aboitiz, F. (2017). A Brain For Speech. A View from Evolutionary Neuroanatomy. placeCityNew York, NY: Pangrave Mac Millan. Aboitiz, F. (2018b). Voice, gesture and working memory in the emergence of speech. Interaction Studies 19, 70–85. doi: 10.1075/bct.112.06abo Aboitiz, F. (2018a). A brain for speech. evolutionary continuity in primate and human auditory-vocal processing. Front. Neurosci. 12:174. doi: 10.3389/fnins. 2018.00174 Aboitiz, F., and García, R. (1997). The evolutionary origin of the language areas in the human brain. Neuroanatomical Perspect. Brain Res. Rev. 25, 381–396. doi: 10.1016/s0165-0173(97)00053-2 Abravanel, E., and Sigafoos, A. D. (1984). Exploring the presence of imitation during early infancy. Child Dev. 55, 381–392. doi: 10.2307/1129950 Altvater-Mackensen, N., and Grossmann, T. (2016). The role of left inferior frontal cortex during audiovisual speech perception in infants. NeuroImage 133, 14–20. doi: 10.1016/j.neuroimage.2016.02.061 Assaneo, M. F., and Poeppel, D. (2018). The coupling between auditory and motor cortices is rate-restricted: evidence for an intrinsic speech-motor rhythm. Sci. Adv. 4:eaao3842. doi: 10.1126/sciadv.aao3842 Athari, P., Dey, R., and Rvachew, S. (2021). Vocal imitation between mothers and infants. Infant Behav. Dev. 63:101531. doi: 10.1016/j.infbeh.2021.101531 Baart, M. (2016). Quantifying lip-read-induced suppression and facilitation of the auditory N1 and P2 reveals peak enhancements and delays. Psychophysiology 53, 1295–1306. doi: 10.1111/psyp.12683 Beauchamp, M. S. (2016). “Audiovisual speech integration: neural substrates and behavior,” in Neurobiology of Language, eds G. Hickok and S. L. Small (Cambridge, MA: Academic Press), 515–526. doi: 10.1016/b978-0-12-4077942.00042-0 Belyk, M., Brown, R., Beal, D. S., Roebroeck, A., McGettigan, C., Guldner, S., et al. (2021). Human larynx motor cortices coordinate respiration for vocal-motor control. NeuroImage 239:118326. doi: 10.1016/j.neuroimage.2021. 118326 Bernstein, L. E., and Liebenthal, E. (2014). Neural pathways for visual speech perception. Front. Neurosci. 8:386. doi: 10.3389/fnins.2014.00386 Biau, E., Wang, D., Park, H., Jensen, O., and Hanslmayr, S. (2021). Auditory detection is modulated by theta phase of silent lip movements. Curr. Res. Neurobiol. 2:100014. doi: 10.1016/j.crneur.2021.100014 Binder, E., Dovern, A., Hesse, M. D., Ebke, M., Karbe, H., Saliger, J., et al. (2017). Lesion evidence for a human mirror neuron system. Cortex 90, 125–137. doi: 10.1016/j.cortex.2017.02.008 Birulés, J., Bosch, L., Pons, F., and Lewkowicz, D. J. (2020). Highly proficient L2 speakers still need to attend to a talker’s mouth when processing L2 speech. Lang. Cogn. Neurosci. 35, 1314–1325. doi: 10.1080/23273798.2020.1762905 Bodin, C., Trapeau, R., Nazarian, B., Sein, J., Degiovanni, X., Baurberg, J., et al. (2021). Functionally homologous representation of vocalizations in the auditory cortex of humans and macaques. Curr. Biol. 31, 4839–4844.e4. doi: 10.1016/j.cub.2021.08.043 Bourguignon, M., Baart, M., Kapnoula, E. C., and Molinaro, N. (2020). Lip-reading enables the brain to synthesize auditory features of unknown silent speech. J. Neurosci. 40, 1053–1065. doi: 10.1523/JNEUROSCI.1101-19.2019 Brown, S., Yuan, Y., and Belyk, M. (2021). Evolution of the speech-ready brain: the voice/jaw connection in the human motor cortex. J. Comp. Neurol. 529, 1018–1028. doi: 10.1002/cne.24997 Frontiers in Psychology | www.frontiersin.org 12 March 2022 | Volume 13 | Article 829083 Michon et al. Rhythmic and Multimodal Brains for Speech Eichert, N., Papp, D., Mars, R. B., and Watkins, K. E. (2020). Mapping human laryngeal motor cortex during vocalization. Cereb. Cortex 30, 6254–6269. doi: 10.1093/cercor/bhaa182 Ferguson, C. J., and Heene, M. (2012). A vast graveyard of undead theories: publication bias and psychological science’s aversion to the null. Perspect. Psychol. Sci. 7, 555–561. doi: 10.1177/1745691612459059 Ferrari, P. F., Visalberghi, E., Paukner, A., Fogassi, L., Ruggiero, A., and Suomi, S. J. (2006). Neonatal imitation in rhesus macaques. PLoS Biol. 4:e302. doi: 10.1371/journal.pbio.0040302 García, R. R., Zamorano, F., and Aboitiz, F. (2014). From imitation to meaning: circuit plasticity and the acquisition of a conventionalized semantics. Front. Hum. Neurosci. 8:605. doi: 10.3389/fnhum.2014.00605 Gavrilov, N., and Nieder, A. (2021). Distinct neural networks for the volitional control of vocal and manual actions in the monkey homologue of Broca’s area. eLife 10:e62797. doi: 10.7554/eLife.62797 Gavrilov, N., Hage, S. R., and Nieder, A. (2017). Functional specialization of the primate frontal lobe during cognitive control of vocalizations. Cell Rep. 21, 2393–2406. doi: 10.1016/j.celrep.2017.10.107 Ghazanfar, A. A. (2013). Multisensory vocal communication in primates and the evolution of rhythmic speech. Behav. Ecol. Sociobiol. 67, 1441–1448. doi: 10. 1007/s00265-013-1491-z Ghazanfar, A. A., and Lemus, L. (2010). Multisensory integration: vision boosts information through suppression in auditory cortex. Curr. Biol. 20, R22–R23. doi: 10.1016/j.cub.2009.11.046 Ghazanfar, A. A., and Takahashi, D. Y. (2014a). Facial expressions and the evolution of the speech rhythm. J. Cogn. Neurosci. 26, 1196–1207. doi: 10.1162/ jocn_a_00575 Ghazanfar, A. A., and Takahashi, D. Y. (2014b). The evolution of speech: vision, rhythm, cooperation. Trends Cogn. Sci. 18, 543–553. doi: 10.1016/j.tics.2014.06. 004 Ghazanfar, A. A., Chandrasekaran, C., and Logothetis, N. K. (2008). Interactions between the superior temporal sulcus and auditory cortex mediate dynamic face/voice integration in rhesus monkeys. J. Neurosci. 28, 4457–4469. doi: 10. 1523/JNEUROSCI.0541-08.2008 Ghazanfar, A. A., Maier, J. X., Hoffman, K. L., and Logothetis, N. K. (2005). Multisensory integration of dynamic faces and voices in rhesus monkey auditory cortex. J. Neurosci. 25, 5004–5012. doi: 10.1523/JNEUROSCI.0799-05. 2005 Ghazanfar, A. A., Morrill, R. J., and Kayser, C. (2013). Monkeys are perceptually tuned to facial expressions that exhibit a theta-like speech rhythm. Proc. Natl. Acad. Sci. U S A. 110, 1959–1963. doi: 10.1073/pnas.121495 6110 Goodale, M. A., and Milner, A. D. (1992). Separate visual pathways for perception and action. Trends Neurosci. 15, 20–25. doi: 10.1016/0166-2236(92)90344-8 Gustison, M. L., and Bergman, T. J. (2017). Divergent acoustic properties of gelada and baboon vocalizations and their implications for the evolution of human speech. J. Lang. Evol. 2, 20–36. doi: 10.1093/jole/lzx015 Hage, S. R., and Nieder, A. (2013). Single neurons in monkey prefrontal cortex encode volitional initiation of vocalizations. Nat. Commun. 4:2409. doi: 10. 1038/ncomms3409 Hage, S. R., and Nieder, A. (2015). Audio-vocal interaction in single neurons of the monkey ventrolateral prefrontal cortex. J. Neurosci. 35, 7030–7040. doi: 10.1523/JNEUROSCI.2371-14.2015 Hage, S. R., and Nieder, A. (2016). Dual neural network model for the evolution of speech and language. Trends Neurosci. 39, 813–829. doi: 10.1016/j.tins.2016.10. 006 Hannon, E. E., and Trehub, S. E. (2005). Tuning in to musical rhythms: infants learn more readily than adults. Proc. Natl. Acad. Sci. U S A. 102, 12639–12643. doi: 10.1073/pnas.0504254102 Hata, T., Kanenishi, K., Akiyama, M., Tanaka, H., and Kimura, K. (2005). Real-time 3-D sonographic observation of fetal facial expression. J. Obstetrics Gynaecol. Res. 31, 337–340. doi: 10.1111/j.1447-0756.2005.00298.x Heiser, M., Iacoboni, M., Maeda, F., Marcus, J., and Mazziotta, J. C. (2003). The essential role of Broca’s area in imitation. Eur. J. Neurosci. 17, 1123–1128. doi: 10.1046/j.1460-9568.2003.02530.x Heyes, C. (2016). Homo imitans? seven reasons why imitation couldn’t possibly be associative. Philos. Trans. R. Soc. B: Biol. Sci. 371:20150069. doi: 10.1098/rstb. 2015.0069 Frontiers in Psychology | www.frontiersin.org Heyes, C. (2021). Imitation and culture: what gives? Mind Lang. 1–22. doi: 10.1111/ mila.12388 Heyes, C., and Catmur, C. (2022). What happened to mirror neurons? Perspect. Psychol. Sci. 17, 153–168. doi: 10.1177/1745691621990638 Heyes, C., Chater, N., and Dwyer, D. M. (2020). Sinking in: the peripheral Baldwinisation of human cognition. Trends Cogn. Sci. 24, 884–899. doi: 10. 1016/j.tics.2020.08.006 Hickok, G., and Poeppel, D. (2004). Dorsal and ventral streams: a framework for understanding aspects of the functional anatomy of language. Cognition 92, 67–99. doi: 10.1016/j.cognition.2003.10.011 Hickok, G., and Poeppel, D. (2007). The cortical organization of speech processing. Nat. Rev. Neurosci. 8, 393–402. doi: 10.1038/nrn2113 Hickok, G., Rogalsky, C., Matchin, W., Basilakos, A., Cai, J., Pillay, S., et al. (2018). Neural networks supporting audiovisual integration for speech: a large-scale lesion study. Cortex 103, 360–371. doi: 10.1016/j.cortex.2018.03.030 Hickok, G., Venezia, J., and Teghipco, A. (2021). Beyond broca: neural architecture and evolution of a dual motor speech coordination system. PsyArXiv [Preprints] doi: 10.31234/osf.io/tewna Iacoboni, M. (2009). Imitation, empathy, and mirror neurons. Annu. Rev. Psychol. 60, 653–670. doi: 10.1146/annurev.psych.60.110707.163604 Iacoboni, M., and Dapretto, M. (2006). The mirror neuron system and the consequences of its dysfunction. Nat. Rev. Neurosci. 7, 942–951. doi: 10.1038/ nrn2024 Jones, S. (2017). Can newborn infants imitate? Wiley Interdisciplinary Rev. Cogn. Sci. 8:e1410. doi: 10.1002/wcs.1410 Jones, S. S. (1996). Imitation or exploration? young infants’ matching of adults’ oral gestures. Child Dev. 67, 1952–1969. doi: 10.2307/1131603 Jones, S. S. (2006). Exploration or imitation? the effect of music on 4-week-old infants’ tongue protrusions. Infant Behav. Dev. 29, 126–130. doi: 10.1016/j. infbeh.2005.08.004 Kaas, J. H., and Hackett, T. A. (1999). ’What’and’where’processing in auditory cortex. Nat. Neurosci. 2, 1045–1047. doi: 10.1038/15967 Kayser, C., Logothetis, N. K., and Panzeri, S. (2010). Visual enhancement of the information representation in auditory cortex. Curr. Biol. 20, 19–24. doi: 10. 1016/j.cub.2009.10.068 Kelly, D. J., Quinn, P. C., Slater, A. M., Lee, K., Ge, L., and Pascalis, O. (2007). The other-race effect develops during infancy: evidence of perceptual narrowing. Psychol. Sci. 18, 1084–1089. doi: 10.1111/j.1467-9280.2007.02029.x Kennedy-Costantini, S., Oostenbroek, J., Suddendorf, T., Nielsen, M., Redshaw, J., Davis, J., et al. (2017). There is no compelling evidence that human neonates imitate. Behav. Brain Sci. 40:e392. doi: 10.1017/S0140525X160 01898 Keven, N., and Akins, K. A. (2017). Neonatal imitation in context: sensorimotor development in the perinatal period. Behav. Brain Sci. 40:e381. doi: 10.1017/ S0140525X16000911 Khandhadia, A. P., Murphy, A. P., Romanski, L. M., Bizley, J. K., and Leopold, D. A. (2021). Audiovisual integration in macaque face patch neurons. Curr. Biol. 31, 1826–1835. doi: 10.1016/j.cub.2021.01.102 Kilner, J. M., Friston, K. J., and Frith, C. D. (2007). Predictive coding: an account of the mirror neuron system. Cogn. Proc. 8, 159–166. doi: 10.1007/s10339-0070170-2 Krasotkina, A., Götz, A., Höhle, B., and Schwarzer, G. (2021). Perceptual narrowing in face-and speech-perception domains in infancy: a longitudinal approach. Infant Behav. Dev. 64:101607. doi: 10.1016/j.infbeh.2021.101607 Kuhl, P. K., Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., and Iverson, P. (2006). Infants show a facilitation effect for native language phonetic perception between 6 and 12 months. Dev. Sci. 9, F13–F21. doi: 10.1111/j.1467-7687.2006. 00468.x Kuhl, P. K., Tsao, F. M., and Liu, H. M. (2003). Foreign-language experience in infancy: effects of short-term exposure and social interaction on phonetic learning. Proc. Natl. Acad. Sci. U S A. 100, 9096–9101. doi: 10.1073/pnas. 1532872100 Kumar, V., Croxson, P. L., and Simonyan, K. (2016). Structural organization of the laryngeal motor cortical network and its implication for evolution of speech production. J. Neurosci. 36, 4170–4181. doi: 10.1523/JNEUROSCI.391415.2016 Kurjak, A., Stanojevic, M., Andonotopo, W., Salihagic-Kadic, A., Carrera, J. M., and Azumendi, G. (2004). Behavioral pattern continuity from prenatal to 13 March 2022 | Volume 13 | Article 829083 Michon et al. Rhythmic and Multimodal Brains for Speech Oostenbroek, J., Suddendorf, T., Nielsen, M., Redshaw, J., Kennedy-Costantini, S., Davis, J., et al. (2016). Comprehensive longitudinal study challenges the existence of neonatal imitation in humans. Curr. Biol. 26, 1334–1338. doi: 10.1016/j.cub.2016.03.047 Oztop, E., and Arbib, M. A. (2002). Schema design and implementation of the grasp-related mirror neuron system. Biol. Cybern. 87, 116–140. doi: 10.1007/ s00422-002-0318-1 Park, H., Ince, R. A., Schyns, P. G., Thut, G., and Gross, J. (2018). Representational interactions during audiovisual speech entrainment: redundancy in left posterior superior temporal gyrus and synergy in left motor cortex. PLoS Biol. 16:e2006558. doi: 10.1371/journal.pbio.2006558 Park, H., Kayser, C., Thut, G., and Gross, J. (2016). Lip movements entrain the observers’ low-frequency brain oscillations to facilitate speech intelligibility. eLife 5:e14521. doi: 10.7554/eLife.14521 Pascalis, O., De Haan, M., and Nelson, C. A. (2002). Is face processing speciesspecific during the first year of life? Science 296, 1321–1323. doi: 10.1126/ science.1070223 Pereira, A. S., Kavanagh, E., Hobaiter, C., Slocombe, K. E., and Lameira, A. R. (2020). Chimpanzee lip-smacks confirm primate continuity for speech-rhythm evolution. Biol. Lett. 16:20200232. doi: 10.1098/rsbl.2020.0232 Perrodin, C., Kayser, C., Logothetis, N. K., and Petkov, C. I. (2014). Auditory and visual modulation of temporal lobe neurons in voice-sensitive and association cortices. J. Neurosci. 34, 2524–2537. doi: 10.1523/JNEUROSCI.2805-13.2014 Petitto, L. A., Berens, M. S., Kovelman, placeI., Dubins, M. H., Jasinska, K., and Shalinsky, M. (2012). The “Perceptual Wedge Hypothesis” as the basis for bilingual babies’ phonetic processing advantage: new insights from fNIRS brain imaging. Brain Lang. 121, 130–143. doi: 10.1016/j.bandl.2011.05.003 Petrides, M. (2005). Lateral prefrontal cortex: architectonic and functional organization. Philos. Trans. R. Soc. B: Biol. Sci. 360, 781–795. doi: 10.1098/rstb. 2005.1631 Petrides, M., and Pandya, D. N. (2002). Comparative cytoarchitectonic analysis of the human and the macaque ventrolateral prefrontal cortex and corticocortical connection patterns in the monkey. Eur. J. Neurosci. 16, 291–310. doi: 10.1046/ j.1460-9568.2001.02090.x Petrides, M., Cadoret, G., and Mackey, S. (2005). Orofacial somatomotor responses in the macaque monkey homologue of Broca’s area. Nature 435, 1235–1238. doi: 10.1038/nature03628 Pitcher, D., and Ungerleider, L. G. (2020). Evidence for a third visual pathway specialized for social perception. Trends Cogn. Sci. 25, 100–110. doi: 10.1016/ j.tics.2020.11.006 Poeppel, D., and Assaneo, M. F. (2020). Speech rhythms and their neural foundations. Nat. Rev. Neurosci. 21, 322–334. doi: 10.1038/s41583-020-0304-4 Pons, F., Bosch, L., and Lewkowicz, D. J. (2015). Bilingualism modulates infants’ selective attention to the mouth of a talking face. Psychol. Sci. 26, 490–498. doi: 10.1177/0956797614568320 Pons, F., Lewkowicz, D. J., Soto-Faraco, S., and Sebastián-Gallés, N. (2009). Narrowing of intersensory speech perception in infancy. Proc. Natl. Acad. Sci. U S A. 106, 10598–10602. doi: 10.1073/pnas.0904134106 Puce, A., Allison, T., Bentin, S., Gore, J. C., and McCarthy, G. (1998). Temporal cortex activation in humans viewing eye and mouth movements. J. Neurosci. 18, 2188–2199. doi: 10.1523/jneurosci.18-06-02188.1998 Ray, E., and Heyes, C. (2011). Imitation in infancy: the wealth of the stimulus. Dev. Sci. 14, 92–105. doi: 10.1111/j.1467-7687.2010.00961.x Reader, A. T., Royce, B. P., Marsh, J. E., Chivers, K. J., and Holmes, N. P. (2018). Repetitive transcranial magnetic stimulation reveals a role for the left inferior parietal lobule in matching observed kinematics during imitation. Eur. J. Neurosci. 47, 918–928. doi: 10.1111/ejn.13886 Redshaw, J. (2019). Re-analysis of data reveals no evidence for neonatal imitation in rhesus macaques. Biol. Lett. 15:20190342. doi: 10.1098/rsbl.2019.0342 Rennig, J., and Beauchamp, M. S. (2018). Free viewing of talking faces reveals mouth and eye preferring regions of the human superior temporal sulcus. Neuroimage 183, 25–36. doi: 10.1016/j.neuroimage.2018.08.008 Restle, J., Murakami, T., and Ziemann, U. (2012). Facilitation of speech repetition accuracy by theta burst stimulation of the left posterior inferior frontal gyrus. Neuropsychologia 50, 2026–2031. doi: 10.1016/j.neuropsychologia.2012.05.001 Risueno-Segovia, C., and Hage, S. R. (2020). Theta synchronization of phonatory and articulatory systems in marmoset monkey vocal production. Curr. Biol. 30, 4276–4283. doi: 10.1016/j.cub.2020.08.019 postnatal life a study by four-dimensional (4D) ultrasonography. J. Perinat. Med. 32, 346–353. doi: 10.1515/JPM.2004.065 Lewkowicz, D. J., and Ghazanfar, A. A. (2009). The emergence of multisensory systems through perceptual narrowing. Trends Cogn. Sci. 13, 470–478. doi: 10.1016/j.tics.2009.08.004 Lewkowicz, D. J., and Hansen-Tift, A. M. (2012). Infants deploy selective attention to the mouth of a talking face when learning speech. Proc. Natl. Acad. Sci. U S A. 109, 1431–1436. doi: 10.1073/pnas.1114783109 Loh, K. K., Petrides, M., Hopkins, W. D., Procyk, E., and Amiez, C. (2017). Cognitive control of vocalizations in the primate ventrolateral-dorsomedial frontal (VLF-DMF) brain network. Neurosci. Biobehav. Rev. 82, 32–44. doi: 10.1016/j.neubiorev.2016.12.001 Loh, K. K., Procyk, E., Neveu, R., Lamberton, F., Hopkins, W. D., Petrides, M., et al. (2020). Cognitive control of orofacial motor and vocal responses in the ventrolateral and dorsomedial human frontal cortex. Proc. Natl. Acad. Sci. U S A. 117, 4994–5005. doi: 10.1073/pnas.1916459117 Maffei, V., Indovina, placeI., Mazzarella, E., Giusti, M. A., Macaluso, E., Lacquaniti, F., et al. (2020). Sensitivity of occipito-temporal cortex, premotor and Broca’s areas to visible speech gestures in a familiar language. PLoS One 15:e0234695. doi: 10.1371/journal.pone.0234695 Mayer, C., Roewer-Despres, F., Stavness, placeI., and Gick, B. (2017). Do innate stereotypies serve as a basis for swallowing and learned speech movements? Behav. Brain Sci. 40:E395. doi: 10.1017/S0140525X16001928 Mégevand, P., Mercier, M. R., Groppe, D. M., Golumbic, E. Z., Mesgarani, N., Beauchamp, M. S., et al. (2020). Crossmodal phase reset and evoked responses provide complementary mechanisms for the influence of visual speech in auditory cortex. J. Neurosci. 40, 8530–8542. doi: 10.1523/JNEUROSCI.0555-20. 2020 Meltzoff, A. N. (1988). Imitation, objects, tools, and the rudiments of language in human ontogeny. Hum. Evol. 3, 45–64. doi: 10.1007/BF02436590 Meltzoff, A. N., and Moore, M. K. (1977). Imitation of facial and manual gestures by human neonates. Science 198, 75–78. doi: 10.1126/science.198.4312.75 Meltzoff, A. N., and Moore, M. K. (1997). Explaining facial imitation: a theoretical model. Infant Child Dev. 6, 179–192. doi: 10.1002/(SICI)1099-0917(199709/12) 6:3/4<179::AID-EDP157>3.0.CO;2-R Meltzoff, A. N., Murray, L., Simpson, E., Heimann, M., Nagy, E., Nadel, J., et al. (2018). Re-examination of Oostenbroek et al.(2016): evidence for neonatal imitation of tongue protrusion. Dev. Sci. 21:e12609. doi: 10.1111/desc.12609 Meltzoff, A. N., Murray, L., Simpson, E., Heimann, M., Nagy, E., Nadel, J., et al. (2019). Eliciting imitation in early infancy. Dev. Sci. 22:e12738. doi: 10.1111/ desc.12738 Metzger, B. A., Magnotti, J. F., Wang, Z., Nesbitt, E., Karas, P. J., Yoshor, D., et al. (2020). Responses to visual speech in human posterior superior temporal gyrus examined with iEEG deconvolution. J. Neurosci. 40, 6938–6948. doi: 10.1523/ JNEUROSCI.0279-20.2020 Michon, M., Boncompte, G., and López, V. (2020). Electrophysiological dynamics of visual speech processing and the role of orofacial effectors for cross-modal predictions. Front. Hum. Neurosci. 14:538619. doi: 10.3389/fnhum.2020.538619 Michon, M., López, V., and Aboitiz, F. (2019). Origin and evolution of human speech: emergence from a trimodal auditory, visual and vocal network. Prog. Brain Res. 250, 345–371. doi: 10.1016/bs.pbr.2019.01.005 Morrill, R. J., Paukner, A., Ferrari, P. F., and Ghazanfar, A. A. (2012). Monkey lipsmacking develops like the human speech rhythm. Dev. Sci. 15, 557–568. doi: 10.1111/j.1467-7687.2012.01149.x Myowa-Yamakoshi, M., Tomonaga, M., Tanaka, M., and Matsuzawa, T. (2004). Imitation in neonatal chimpanzees (Pan troglodytes). Dev. Sci. 7, 437–442. doi: 10.1111/j.1467-7687.2004.00364.x Neef, N. E., Primaßin, A., Gudenberg, A. W. V., Dechent, P., Riedel, H. C., Paulus, W., et al. (2021). Two cortical representations of voice control are differentially involved in speech fluency. Brain Commun. 3:fcaa232. doi: 10. 1093/braincomms/fcaa232 Neubert, F. X., Mars, R. B., Thomas, A. G., Sallet, J., and Rushworth, M. F. (2014). Comparison of human ventral frontal cortex areas for cognitive control and language with areas in monkey frontal cortex. Neuron 81, 700–713. doi: 10. 1016/j.neuron.2013.11.012 Oostenbroek, J., Redshaw, J., Davis, J., Kennedy-Costantini, S., Nielsen, M., Slaughter, V., et al. (2018). Re-evaluating the neonatal imitation hypothesis. Dev. Sci. 22:e12720. doi: 10.1111/desc.12720 Frontiers in Psychology | www.frontiersin.org 14 March 2022 | Volume 13 | Article 829083 Michon et al. Rhythmic and Multimodal Brains for Speech Tsang, T., Atagi, N., and Johnson, S. P. (2018). Selective attention to the mouth is associated with expressive language skills in monolingual and bilingual infants. J. Exp. Child Psychol. 169, 93–109. doi: 10.1016/j.jecp.2018.01.002 Ungerleider, L.G. and Mishkin, M. (1982). “Two cortical visual systems,” in Analysis of Visual Behavior D. J. Ingle, ed (Cambridge, MA: MIT Press). Van Wassenhove, V., Grant, K. W., and Poeppel, D. (2005). Visual speech speeds up the neural processing of auditory speech. Proc. Natl. Acad. Sci. U S A. 102, 1181–1186. doi: 10.1073/pnas.0408949102 Wang, A., Payne, C., Moss, S., Jones, W. R., and Bachevalier, J. (2020). Early developmental changes in visual social engagement in infant rhesus monkeys. Dev. Cogn. Neurosci. 43:100778. doi: 10.1016/j.dcn.2020.100778 Weatherhead, D., Arredondo, M. M., Nácar Garcia, L., and Werker, J. F. (2021). The role of audiovisual speech in fast-mapping and novel word retention in monolingual and bilingual 24-month-olds. Brain Sci. 11:114. doi: 10.3390/ brainsci11010114 Weikum, W. M., Vouloumanos, A., Navarra, J., Soto-Faraco, S., Sebastián-Gallés, N., and Werker, J. F. (2007). Visual language discrimination in infancy. Science 316, 1159–1159. doi: 10.1126/science.1137686 Werker, J. F., and Hensch, T. K. (2015). Critical periods in speech perception: new directions. Annu. Rev. Psychol. 66, 173–196. doi: 10.1146/annurev-psych010814-015104 Werker, J. F., and Tees, R. C. (1984). Cross-language speech perception: evidence for perceptual reorganization during the first year of life. Infant Behav. Dev. 7, 49–63. doi: 10.1016/j.cognition.2006.03.006 Zentall, T. R. (2012). Perspectives on observational learning in animals. J. Comp. Psychol. 126:114. doi: 10.1037/a0025381 Zhang, H., Japee, S., Stacy, A., Flessert, M., and Ungerleider, L. G. (2020). Anterior superior temporal sulcus is specialized for non-rigid facial motion in both monkeys and humans. Neuroimage 218:116878. doi: 10.1016/j.neuroimage. 2020.116878 Zhu, L. L., and Beauchamp, M. S. (2017). Mouth and voice: a relationship between visual and auditory preference in the human superior temporal sulcus. J. Neurosci. 37, 2697–2708. doi: 10.1523/JNEUROSCI.2914-16.2017 Zoefel, B. (2021). Visual speech cues recruit neural oscillations to optimize auditory perception: ways forward for research on human communication. Curr. Res. Neurobiol. 2:100015. doi: 10.1016/j.crneur.2021.100015 Rizzolatti, G., and Craighero, L. (2004). The mirror-neuron system. Annu. Rev. Neurosci. 27, 169–192. Rizzolatti, G., and Fogassi, L. (2016). “Evolution of mirror neuron mechanism in primates,” in Evolution of Nervous Systems, ed. J. Kaas (Cambridge, MA: Academic Press). Rizzolatti, G., and Sinigaglia, C. (2008). Mirrors in the Brain: How our Minds Share Actions and Emotions. CityOxford: PlaceNameplaceOxford PlaceTypeUniversity Press. Rocchi, F., Oya, H., Balezeau, F., Billig, A. J., Kocsis, Z., placeJenison, R. L., et al. (2021). Common fronto-temporal effective connectivity in humans and monkeys. Neuron 109, 852–868. doi: 10.1016/j.neuron.2020.12.026 Romanski, L. M. (2007). Representation and integration of auditory and visual stimuli in the primate ventral lateral prefrontal cortex. Cereb. Cortex 17(Suppl._1), i61–i69. doi: 10.1093/cercor/bhm099 Romanski, L. M. (2012). Integration of faces and vocalizations in ventral prefrontal cortex: implications for the evolution of audiovisual speech. Proc. Natl. Acad. Sci. U S A. 109(Suppl. 1), 10717–10724. doi: 10.1073/pnas.1204335109 Romanski, L. M., Tian, B., Fritz, J., Mishkin, M., Goldman-Rakic, P. S., and Rauschecker, J. P. (1999). Dual streams of auditory afferents target multiple domains in the primate prefrontal cortex. Nat. Neurosci. 2, 1131–1136. doi: 10.1038/16056 Scholes, C., Skipper, J. I., and Johnston, A. (2020). The interrelationship between the face and vocal tract configuration during audiovisual speech. Proc. Natl. Acad. Sci. U S A. 117, 32791–32798. doi: 10.1073/pnas.2006192117 Sebastián-Gallés, N., Albareda-Castellot, B., Weikum, W. M., and Werker, J. F. (2012). A bilingual advantage in visual language discrimination in infancy. Psychol. Sci. 23, 994–999. doi: 10.1177/0956797612436817 Shepherd, S. V., and Freiwald, W. A. (2018). Functional networks for social communication in the macaque monkey. Neuron 99, 413–420. doi: 10.1016/j. neuron.2018.06.027 Simonyan, K. (2014). The laryngeal motor cortex: its organization and connectivity. Curr. Opin. Neurobiol. 28, 15–21. doi: 10.1016/j.conb.2014.05.006 Simonyan, K., and Horwitz, B. (2011). Laryngeal motor cortex and control of speech in humans. Neuroscientist 17, 197–208. doi: 10.1177/10738584103 86727 Simpson, E. A., Murray, L., Paukner, A., and Ferrari, P. F. (2014). The mirror neuron system as revealed through neonatal imitation: presence from birth, predictive power and evidence of plasticity. Philos. Trans. R. Soc. B: Biol. Sci. 369:20130289. doi: 10.1098/rstb.2013.0289 Slaughter, V. (2021). Do newborns have the ability to imitate? Trends Cogn. Sci. 25, 377–387. doi: 10.1016/j.tics.2021.02.006 Subiaul, F. (2010). Dissecting the imitation faculty: the multiple imitation mechanisms (MIM) hypothesis. Behav. Proc. 83, 222–234. doi: 10.1016/j. beproc.2009.12.002 Sugihara, T., Diltz, M. D., Averbeck, B. B., and Romanski, L. M. (2006). Integration of auditory and visual communication information in the primate ventrolateral prefrontal cortex. J. Neurosci. 26, 11138–11147. doi: 10.1523/JNEUROSCI. 3550-06.2006 Sumby, W. H., and Pollack, placeI. (1954). Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26, 212–215. doi: 10.1121/1.1907309 Takahashi, D. Y., Liao, D. A., and Ghazanfar, A. A. (2017). Vocal learning via social reinforcement by infant marmoset monkeys. Curr. Biol. 27, 1844–1852. doi: 10.1016/j.cub.2017.05.004 Frontiers in Psychology | www.frontiersin.org Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Publisher’s Note: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher. Copyright © 2022 Michon, Zamorano-Abramson and Aboitiz. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. 15 March 2022 | Volume 13 | Article 829083
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Anne-Nelly Perret-Clermont
University of Neuchâtel
Robert Lickliter
Florida International University
Simon Baron-Cohen
University of Cambridge
Daniel Brugman
Utrecht University