Inverted encoding of neural responses to audiovisual stimuli reveals super-additive multisensory enhancement

We exist in a complex, dynamically changing sensory environment. Vertebrates, including humans, have evolved sensory organs that transduce relevant sources of physical information, such as light and changes in air pressure, into patterns of neural activity that support perception (vision and audition) and adaptive behaviour. Such activity patterns are noisy and often ambiguous, due to a combination of external (environmental) and internal (transduction) factors. Critically, information from the different sensory modalities can be highly correlated because it is often elicited by a common external source or event. For example, the sight and sound of a hammer hitting a nail produces a single, unified perceptual experience, as does the sight of a person’s lips moving as we hear their voice. To improve the reliability of neural representations, the brain leverages these sensory relationships by combining information in a process referred to as multisensory integration. The existence of such processes heightens perception, e.g., by making it easier to understand a person’s speech in a noisy setting by looking at their lip movements (Sumby and Pollack, 1954).

Multisensory integration of audiovisual cues improves performance across a range of behavioural outcomes, including detection accuracy (Bolognini et al., 2005; Frassinetti et al., 2002; Lovelace et al., 2003), response speed (Arieh and Marks, 2008; Cappe et al., 2009; Colonius and Diederich, 2004; Rach and Diederich, 2006; Senkowski et al., 2011), and saccade speed and accuracy (Corneil et al., 2002; Van Wanrooij et al., 2009). Successful integration requires the constituent stimuli to occur at approximately the same place and time (Leone and McCourt, 2015). The degree to which behavioural performance is improved follows the principles of maximum likelihood estimation (MLE), wherein sensory information from each modality is weighted and integrated according to its relative reliability (Alais and Burr, 2004; Ernst and Banks, 2002; although other processing schemes have also been identified; Rideaux and Welchman, 2018). As such, behavioural performance that matches MLE predictions is often seen as a benchmark of successful, optimal integration of relevant unisensory cues.

The ubiquity of behavioural enhancements for audiovisual stimuli suggests there are fundamental neural mechanisms that facilitate improved precision. Recordings from single multisensory (audiovisual) neurons within cat superior colliculus have revealed the principle of inverse effectiveness, whereby the increased response to audiovisual stimuli is larger when the constituent unisensory stimuli are weakly stimulating (Corneil et al., 2002; Meredith and Stein, 1983). Depending on the intensity of the integrated stimuli, the neural response can be either super-additive, where the multisensory response is greater than the sum of the unisensory responses, additive, equal to the sum of responses, or sub-additive, where the combined response is less than the sum of the unisensory responses (see Stein and Stanford, 2008). Inverse effectiveness has also been observed in human behavioural experiments, with low-intensity audiovisual stimuli eliciting greater multisensory enhancements in response precision than those of high intensity (Colonius and Diederich, 2004; Corneil et al., 2002; Rach and Diederich, 2006; Rach et al., 2011).

Neuroimaging methods, such as electroencephalography (EEG) and functional magnetic resonance imaging (fMRI), have been used to investigate neural population-level audiovisual integration in humans. These studies have typically applied an additive criterion to quantify multisensory integration, wherein successful integration is marked by a non-linear enhancement of audiovisual responses relative to unisensory responses (Besle et al., 2004). The super- or sub-additive nature of this enhancement, however, is often inconsistent. In fMRI, neural super-additivity in blood-oxygen-level-dependent (BOLD) responses to audiovisual stimuli has been found in a variety of regions, primarily the superior temporal sulcus (STS; Calvert et al., 2000; Calvert et al., 2001; Stevenson et al., 2007; Stevenson and James, 2009; Werner and Noppeney, 2010; Werner and Noppeney, 2011). However, other studies have failed to replicate audiovisual super-additivity in the STS (Joassin et al., 2011; Porada et al., 2021; Ross et al., 2022; Venezia et al., 2015) or have found sub-additive responses (see Scheliga et al., 2023, for review). As such, some have argued that BOLD responses are not sensitive enough to adequately characterise super-additive audiovisual interactions within populations of neurons (Beauchamp, 2005; James et al., 2012; Laurienti et al., 2005). In EEG, meanwhile, the evoked response to an audiovisual stimulus typically conforms to a sub-additive principle (Cappe et al., 2010; Fort et al., 2002; Giard and Peronnet, 1999; Murray et al., 2016; Puce et al., 2007; Stekelenburg and Vroomen, 2007; Teder-Sälejärvi et al., 2002; Vroomen and Stekelenburg, 2010). However, when the principle of inverse effectiveness is considered and relatively weak stimuli are presented together, there has been some evidence for super-additive responses (Senkowski et al., 2011).

It is important to consider the differences in how super-additivity is classified between neural and behavioural measures. At the level of single neurons, super-additivity is defined as a non-linear response enhancement, with the multisensory response exceeding the sum of the unisensory responses. In behaviour, meanwhile, it has been observed that the performance improvement from combining two senses is close to what is expected from optimal integration of information across the senses (Alais and Burr, 2004; Stanford and Stein, 2007). Critically, behavioural enhancement of this kind does not require non-linearity in the neural response, but can arise from a reliability-weighted average of sensory information. In short, behavioural performance that conforms to MLE is not necessarily indicative of neural super-additivity, and the MLE model can be considered a linear baseline for multisensory integration.

While behavioural outcomes for multisensory stimuli can be predicted by MLE, and single neuron responses follow the principles of inverse effectiveness and super-additivity, among others (Rideaux et al., 2021), how audiovisual super-additivity manifests within populations of neurons is comparatively unclear given the mixed findings from relevant fMRI and EEG studies. This uncertainty may be due to biophysical limitations of human neuroimaging techniques, but it may also be related to the analytic approaches used to study these recordings. For instance, super-additive responses to audiovisual stimuli in EEG studies are often reported from very small electrode clusters (Molholm et al., 2002; Senkowski et al., 2011; Talsma et al., 2007), suggesting that neural super-additivity in humans may be highly specific. However, information encoded by the brain can be represented as increased activity in some areas, accompanied by decreased activity in others, so simplifying complex neural responses to the average rise and fall of activity in specific sensors may obscure relevant multivariate patterns of activity evoked by a stimulus.

Inverted encoding is a multivariate analytic method that can reveal how sensory information is encoded within the brain by recovering patterns of neural activity associated with different stimulus features. This method has been successfully used in fMRI, EEG, and magnetoencephalography studies to characterise the neural representations of a range of stimulus features, including colour (Brouwer and Heeger, 2009), spatial location (Bednar and Lalor, 2020; Robinson et al., 2021), and orientation (Brouwer and Heeger, 2011; Harrison et al., 2023; Kok et al., 2017). A multivariate approach may capture potential non-linear enhancements associated with audiovisual responses and thus could reveal super-additive interactions that would otherwise be hidden within the brain’s univariate responses. The sensitivity of inverted encoding analyses to multivariate neural patterns may provide insight into how audiovisual information is processed and integrated at the population level.

In the present study, we investigated neural super-additivity in human audiovisual sensory processing using inverted encoding of EEG responses during a task where participants had to spatially localise visual, auditory, and audiovisual stimuli. In a separate behavioural experiment, we monitored response accuracy to characterise behavioural improvements to audiovisual relative to unisensory stimuli. Although there was no evidence for super-additivity in response to audiovisual stimuli within univariate ERPs, we observed a reliable non-linear enhancement of multivariate decoding performance at ~180 ms following stimulus onset when auditory and visual stimuli were presented concurrently as opposed to alone. These findings suggest that population-level super-additive multisensory neural responses are present within multivariate patterns of activity rather than univariate evoked responses.

Inverted encoding of neural responses to audiovisual stimuli reveals super-additive multisensory enhancement

Tags: