This study demonstrates that elementary audiovisual analyses are sufficient to replicate behaviours consistent with multisensory perception in mammals. The proposed image- and sound-computable model, composed of a population of biologically plausible elementary processing units, provides a stimulus-driven framework for multisensory perception that transforms raw audiovisual input into behavioural predictions. Starting directly from pixels and audio samples, our model closely matched observed behaviour across a wide range of phenomena—including multisensory illusions, spatial orienting, and causal inference—with average correlations above 0.97. This was tested in a large-scale simulation spanning 69 audiovisual experiments, seven behavioural tasks, and data from 534 humans, 110 rats, and two monkeys.
We define a stimulus-computable model as one that receives input directly from the stimulus—such as raw images and sound waveforms—rather than from abstracted descriptors like lag, disparity, or reliability. Framed in Marr’s terms, stimulus-computable models operate at the algorithmic level, specifying how sensory information is represented and processed. This contrasts with computational-level models, such as Bayesian ideal observers, which define the goals of perception (e.g. maximizing reliability Alais and Burr, 2004; Ernst and Banks, 2002) without specifying how those goals are achieved. Rather than competing with such normative accounts, the MCD provides a mechanistic substrate that could plausibly implement them. By operating directly on realistic audiovisual signals, our population model captures the richness of natural sensory input and directly addresses the problem of how biological systems represent and process multisensory information (Burge, 2020). This allows the MCD to generate precise, stimulus-specific predictions across tasks, including subtle differences in behavioural outcomes that arise from the structure of individual stimuli (see Figure 2—figure supplement 1K).
The present approach naturally lends itself to be generalized and tested against a broad range of tasks, stimuli, and responses—as reflected by the breadth of the experiments simulated here. Among the perceptual effects emerging from elementary signal processing, one notable example is the scaling of subjective audiovisual synchrony with sound source distance (Alais and Carlile, 2005). As sound travels slower than light, humans compensate for audio delays by adjusting subjective synchrony based on the source’s distance scaled by the speed of sound. Although this phenomenon appears to rely on explicit physics modelling, our simulations demonstrate that auditory cues embedded in the envelope (Figure 2B, left) are sufficient to scale subjective audiovisual synchrony. In a similar fashion, our simulations show that phenomena such as the McGurk illusion, the subjective timing of natural audiovisual stimuli, and saliency detection may emerge from elementary operations performed at pixel level, bypassing the need for more sophisticated analyses such as image segmentation, lip or face tracking, 3D reconstruction, etc (Chandrasekaran et al., 2009). Elementary, general-purpose operations on natural stimuli can drive complex behaviour, sometimes even in the absence of advanced perceptual and cognitive contributions. Indeed, it is intriguing that a population of MCDs, a computational architecture originally proposed for motion vision in insects, can predict speech illusions in humans.
The fact that identical low-level analyses can account for all of the 69 experiments simulated here directly addresses several open questions in multisensory research. For instance, psychometric functions for speech and non-speech stimuli often differ significantly (Vatakis et al., 2008). This has been interpreted as evidence that speech may be special and processed via dedicated mechanisms (Tuomainen et al., 2005). However, identical low-level analyses are sufficient to account for all observed responses, regardless of the stimulus type (Figure 2, Figure 2—figure supplements 1 and 2). This suggests that most of the differences in psychometric curves across classes of stimuli (e.g. speech vs. non-speech vs. clicks-&-flashes) are due to the low-level features of the stimuli themselves, not how the brain processes them. Similarly, experience and expertise also modulate multisensory perception. For example, audiovisual simultaneity judgments differ significantly between musicians and non-musicians (Lee and Noppeney, 2011) (see Figure 2—figure supplement 1C). Likewise, the McGurk illusion (Freeman et al., 2013) and subjective audiovisual timing (Petrini et al., 2009) vary over the lifespan in humans, and following pharmacological interventions in rats (Al Youzbaki et al., 2023; Schormans and Allman, 2023) (see Figure 2—figure supplement 1E and J and Figure 2—figure supplement 2F-G). Our simulations show that adjustments at the decision-making level are sufficient to account for these effects, without requiring structural or parametric changes to low-level perceptual processing across observers or conditions.
Although the same model explains responses to multisensory stimuli in humans, rats, and monkeys, the temporal constants vary across species. For example, the model for rats is tuned to temporal frequencies over four times higher than those for humans. This not only explains the differential sensitivity of humans and rats to long and short audiovisual lags, but it also mirrors analogous interspecies differences in physiological rhythms, such as heart and breathing rates (Agoston, 2017). Previous research has shown that physiological arousal modulates perceptual rhythms within individuals (Legrand et al., 2018). It is an open question whether the same association between multisensory temporal tuning and physiological rhythms persists in other mammalian systems. Conversely, no major differences in the model’s spatial tuning were found between humans and macaques, possibly reflecting the close phylogenetic link between the two species.
How might these computations be implemented neurally? In a recent study (Pesnot Lerousseau et al., 2022), we identified neural responses in the posterior superior temporal sulcus, superior temporal gyrus, and left superior parietal gyrus that tracked the output of an MCD model during audiovisual temporal tasks. Participants were presented with random sequences of clicks and flashes while performing either a causality judgment or a temporal order judgment task. By applying a time-resolved encoding model to MEG data, we demonstrated that MCD dynamics aligned closely with stimulus-evoked cortical activity. The present study considerably extends the scope of the MCD framework, allowing it to process more naturalistic stimuli and to account for a broader range of behaviours—including cue combination, attentional orienting, and gaze-based decisions. This expansion opens the door to new neurophysiological investigations into the implementation of multisensory integration. For instance, the dynamic, spatially distributed population responses generated by the MCD (see videos) can be directly compared with neural population activity recorded using techniques such as ECoG, Neuropixels, or high-density fMRI—similar to previous efforts that linked the Bayesian Causal Inference model to neural responses during audiovisual spatial integration (Rohe et al., 2019; Aller and Noppeney, 2019; Rohe and Noppeney, 2015). Such comparisons may help bridge algorithmic and implementational levels of analysis, offering concrete hypotheses about how audiovisual correspondence detection and integration are instantiated in the brain.
An informative outcome of our simulations is the model’s ability to predict spontaneous gaze direction in response to naturalistic audiovisual stimuli. Saliency, the property by which some elements in a display stand out and attract observer’s attention and gaze direction, is a popular concept in both cognitive and computer sciences (Itti et al., 1998). In computer vision, saliency models are usually complex and rely on advanced signal processing and semantic knowledge—typically with tens of millions of parameters (Chen et al., 2023; Coutrot, 2025). Despite successfully predicting gaze behaviour, current audiovisual saliency models are often computationally expensive, and the resulting maps are hard to interpret and inevitably affected by the datasets used for training (Adebayo et al., 2023). In contrast, our model detects saliency ‘out of the box,’ without any free parameters, and operating purely at the individual pixel level. The elementary nature of the operations performed by a population of MCDs returns saliency maps that are easy to interpret: salient points are those with high audiovisual correlation. By grounding multisensory integration and saliency detection in biologically plausible computations, our study offers a new tool for machine perception and robotics to handle multimodal inputs in a more human-like way, while also improving system accountability.
This framework also provides a solution for self-supervised and unsupervised audiovisual learning in multimodal machine perception. A key challenge when handling raw audiovisual data is solving the causal inference problem—determining whether signals from different modalities are causally related or not (Körding et al., 2007). Models in machine perception often depend on large, labelled datasets for training. In this context, a biomimetic module that handles saliency maps, audiovisual correspondence detection, and multimodal fusion can drive self-supervised learning through simulated observers, thereby reducing the dependency on labelled data (Shahabaz and Sarkar, 2024; Arandjelovic and Zisserman, 2017; Ngiam et al., 2011). Furthermore, the simplicity of our population-based model provides a computationally efficient alternative for real-time multisensory integration in applications such as robotics, AR/VR, and other low-latency systems.
Although a population of MCDs can explain when phenomena such as the McGurk Illusion occur, it does not explain the process of phoneme categorization that ultimately determines what syllable is perceived (Magnotti and Beauchamp, 2017). More generally, it is well known that cognitive and affective factors modulate our responses to multisensory stimuli (Stein, 2012). In particular, the model does not currently incorporate linguistic mechanisms or top-down predictive processes, which play a central role in audiovisual speech perception—such as the integration of complementary articulatory features, lexical expectations, or syntactic constraints (Campbell, 2008; Peelle and Sommers, 2015; Summerfield, 1987; Tye-Murray et al., 2007). While a purely low-level model does not directly address these issues, the modularity of our approach makes it possible to extend the system to include high-level perceptual, cognitive, and affective factors. What is more, although this study focused on audiovisual integration in mammals, the same approach can be naturally extended to other instances of sensory integration (e.g. visuo- and audio-tactile) and animal classes. A possible extension of the model for trimodal integration is included in Figure 4—figure supplement 1.
Besides simulating behavioural responses, a stimulus-computable approach necessarily makes explicit all the intermediate steps of sensory information processing. This opens the system to inspection at all of its levels, thereby allowing for direct comparisons with neurophysiology (Pesnot Lerousseau et al., 2022). In insect motion vision, this transparency made it possible for the Hassenstein-Reichardt detector to act as a searchlight to link computation, behavior, and physiology at the scale of individual cells (Serbe et al., 2016). Being based on formally identical computational principles (Parise and Ernst, 2016), the present approach holds the same potential for multisensory perception.