All experimental procedures involving animals were conducted in accordance with local laws and approved by the relevant institutional ethics committees. Approvals were granted by the Animal Welfare Ethical Review Body of University College London, under licences P1DB285D8, PCC4A4ECE and PD867676F, issued by the UK Home Office. Experiments conducted at Princeton University were approved under licence 1876-20 by the Institutional Animal Care and Use Committee (IACUC). At Cold Spring Harbor Laboratory, approvals were granted under licences 1411117 and 19.5 by the institutional IACUC. The University of California at Los Angeles granted approval through IACUC licence 2020-121-TR-00. Additional approvals were obtained from the University Animal Welfare Committee of New York University (licence 18-1502), the IACUC at the University of Washington (licence 4461-01), the IACUC at the University of California, Berkeley (licence AUP-2016-06-8860-1) and the Portuguese Veterinary General Board (DGAV) for experiments conducted at the Champalimaud Foundation (licence 0421/0000/0000/2019).

Animals

Mice were housed under a 12–12-h light–dark cycle (normal or inverted depending on the laboratory) with food and water available ad libitum, except during behavioural training days. Electrophysiological recordings and behavioural training were performed during either the dark or light phase of the cycle depending on the laboratory. The data from n = 139 adult mice (C57BL/6; 94 male and 45 female, obtained from either Jackson Laboratory or Charles River) were used in this study. Mice were aged 13–178 weeks (mean 44.96 weeks, median 27.0 weeks) and weighed 16.1–35.7 g (mean 23.9 g, median 23.84 g) on the day of electrophysiological recordings. We did not attempt to standardize other variables such as temperature, humidity and environmental sound, but we regularly documented and measured them23.

Headbar implant surgery

A detailed account of the surgical methods for the headbar implant is provided in appendix 1 of ref. 23. In brief, mice were anaesthetized with isoflurane and head-fixed in a stereotaxic frame. The fur was then removed from their scalp, which was subsequently removed along with the underlying periosteum. Once the skull was exposed, Bregma and Lambda were marked. The head was positioned along the anterior–posterior and left–right axes using stereotaxic coordinates. The head bar was then placed in one of three stereotactically defined locations and cemented (Super-Bond C&B) in place. Future craniotomy positions were marked on the skull relative to Bregma. The exposed skull was then covered with cement and clear UV curing glue (Norland Optical Adhesives).

Materials and apparatus

For detailed parts lists and installation instructions for the training rigs, see appendix 3 of ref. 23; for the electrophysiology rigs, see appendix 1 of ref. 25.

Each laboratory installed a standardized electrophysiological rig, which differed slightly from the apparatus used during behavioural training23. The structure of the rig was constructed from Thorlabs parts and was placed on an air table (Newport, M-VIS3036-SG2-325A) surrounded by a custom acoustic cabinet. A static headbar fixation clamp and a 3D-printed mouse holder were used to hold a mouse such that its forepaws rested on the steering wheel (86652 and 32019, Lego)23. Silicone tubing controlled by a pinch valve (225P011-21, NResearch) was used to deliver water rewards to the mouse. Visual stimuli were displayed on an LCD screen (LP097Q  × 1, LG). To measure the timing of changes in the visual stimulus, a patch of pixels on the LCD screen flipped between white and black at every stimulus change, and this flip was captured with a photodiode (Bpod Frame2TTL, Sanworks). Ambient temperature, humidity and barometric air pressure were measured using a Bpod Ambient module (Sanworks), and the wheel position was monitored with a rotary encoder (05.2400.1122.1024, Kubler). Videos of mice were recorded from 3 angles (left, right and body) with USB cameras (CM3-U3-13Y3M-CS, Point Grey) sampling at 60, 150 and 30 Hz, respectively (for details, see appendix 1 of ref. 25). A custom speaker (Hardware Team of the Champalimaud Foundation for the Unknown, v.1.1) was used to play task-related sounds, and an ultrasonic microphone (Ultramic UM200K, Dodotronic) was used to record ambient noise from the rig. All task-related data were coordinated using a Bpod State Machine (Sanworks). The task logic was programmed in Python, and the visual stimulus presentation and video capture were handled by Bonsai93 and the BonVision package94.

Neural recordings were made using Neuropixels probes, either v.1.0 (3A or 3B2, n = 109 and n = 586 insertions, respectively) or v.2.4 (n = 4 insertions) (Imec13), which were advanced into the brain using a micromanipulator (Sensapex, uMp-4). Typically, the probes were tilted at a 15° angle from the vertical line. Data were acquired using an FPGA (for 3A probes) or PXI (for 3B and 1.0 probes, National Instruments) system using SpikeGLX, and stored on a PC.

Habituation, training and experimental protocol

For a detailed protocol on animal training, see the methods in refs. 23,25. In brief, at the beginning of each trial, the mouse was required to not move the wheel for a quiescence period of 400–700 ms. After the quiescence period, a visual stimulus (Gabor patch) appeared on either the left or right (±35° azimuth) of the screen, with a contrast randomly sampled from a predefined set (100, 25, 12.5, 6 or 0%). A 100-ms tone (5-kHz sine wave) was played at stimulus onset. Mice had 60 s to move the wheel and make a response. Stimuli were yoked to the rotation of the response wheel, such that a 1-mm movement of the wheel moved the stimulus by 4 visual degrees. A response was registered if the centre of the stimulus crossed the  ±35° azimuth line from its original position. If the mouse correctly moved the stimulus 35° to the centre of the screen, it immediately received a 3-μl reward; if it incorrectly moved the stimulus 35° away from the centre, it received a time out. If the mouse responded incorrectly or failed to reach either threshold within the 60-s window, a white-noise burst was played for 500 ms and the inter-trial interval was between 1 and 1.5 s. In trials for which the visual stimulus contrast was set to 0%, the mouse had to respond as for any other trial by turning the wheel in the correct direction (assigned according to the statistics of the prevailing block) to receive a reward, but the mouse was not able to perceive whether the stimulus was presented on the left or right side of the screen. The mouse also received feedback (noise burst or reward) on 0% contrast trials.

Each session started with 90 trials in which the probability of a visual stimulus appearing on the left or right side was equal. Specifically, the 100%, 25%, 12.5% and 6% contrast trials were each presented 10 times on each side, and the 0% contrast was presented 10 times in total (that is, the ratio of the 100, 25, 12.5, 6 and 0% contrasts were set at 2, 2, 2, 2 and 1, respectively). The side (and thus correct movement) for the 0% contrast trials was chosen randomly between the right and left with equal probability. This initial block of 90 trials is referred to as the unbiased block (50:50).

After the unbiased block, trials were presented in biased blocks: in right-bias blocks, stimuli appeared on the right on 80% of the trials, whereas in left-bias blocks, stimuli appeared on the right on 20% of the trials. The ratio of the contrasts remained as above (2:2:2:2:1). Whether the first biased block in a session was left or right was randomly chosen, and blocks were then alternated. The length of a block was drawn from an exponential distribution with scale parameter of 60 trials, but truncated to lie between 20 and 100 trials.

The automated shaping protocol for training23 involved two collections of sessions. In the first session, the animals started performing a version of the task without biased blocks and were then progressively introduced to harder stimuli with weaker contrasts as they became progressively more competent. They also experienced a debiasing protocol, which was intended to dissuade them from persisting with just one of the choices. Once they were performing sufficiently well on all non-zero contrasts, they were faced with the biased blocks. When, in turn, performance on those was adequate (including on 0% contrast trials, which are informed by the block), they graduated to recording. Supplementary Figure 8a shows a joint histogram of the number of sessions the mice took in the first and second collections (these were not correlated). Supplementary Figure 8b shows a joint histogram of the number of sessions the mice took in the second collection and the performance during the recording sessions. These were also not correlated.

Electrophysiological recording using Neuropixels probes

For details on the craniotomy surgery, see appendix 3 of ref. 25. In brief, on the first day of electrophysiological recording, the animal was anaesthetized using isoflurane and surgically prepared. The mouse was subcutaneously administered with analgesics (typically Carprofen). The UV cured glue was removed (typically using a biopsy punch (Kai Disposable Biopsy Punches (1 mm)) or a drill), exposing the skull over the planned craniotomy site (or sites). A test was made to check whether the implant could hold liquid; the bath was then grounded either through a loose or implanted pin. One or two craniotomies (approximately 1 × 1 mm) were made over the marked locations. The dura was left intact, and the brain was lubricated with artificial cerebrospinal fluid. A moisturising sealant was applied over the dura (typically DuraGel (Cambridge NeuroTech) covered with a layer of Kwikcast (World Precision Instruments). The mouse was left to recover in a heating chamber until locomotor and grooming activity were fully recovered.

Mice were head-fixed for recording after a minimum recovery period of 2 h. Once a craniotomy was made, up to four subsequent recording sessions were made in that same craniotomy. Once the first set of craniotomy was fully recorded from, a mouse could undergo another craniotomy surgery in accordance with the institutional licence. Up to two probes were implanted in the brain on a given session. CM-Dil (V22888, Thermo Fisher) was used to label probes for subsequent histology analyses.

Serial section two-photon imaging

Mice were given a terminal dose of pentobarbital and perfuse-fixed with PBS followed by 4% formaldehyde solution (Thermo Fisher 28908) in 0.1 M PB pH 7.4. The whole mouse brain was dissected and post-fixed in the same fixative for a minimum of 24 h at room temperature. Tissue samples were washed and stored for up to 2–3 weeks in PBS at 4 °C before shipment to the Sainsbury Wellcome Centre for image acquisition. For full details, see appendix 5 of ref. 25.

For imaging, brains were equilibrated with 50 mM PB solution and embedded into 5% agarose gel blocks. The brains were imaged by serial section two-photon microscopy29,95. The microscope was controlled with ScanImage Basic (Vidrio Technologies) and BakingTray, a custom software wrapper for setting up the imaging parameters96. Image tiles were assembled into 2D planes using StitchIt97. Whole brain coronal image stacks were acquired at a resolution of 4.4 × 4.4 × 25.0 μm in xyz, with a two-photon laser wavelength of 920 nm and approximately 150 mW at the sample. The microscope cut 50-μm sections but imaged two optical planes in each slice at depths of about 30 μm and 55 μm from the tissue surface. Two channels of image data were simultaneously acquired using multialkali PMTs (green at 525 ± 25 nm; red at 570 nm low pass).

Whole brain images were downsampled to 25-μm isotropic voxels and registered to the adult mouse Allen Common Coordinate Framework6 using BrainRegister98, which is an elastix-based99 registration pipeline with optimized parameters for mouse brain registration. For full details, see appendix 7 of ref. 25.

Probe track tracing and alignment

Neuropixels probe tracks were manually traced to produce a probe trajectory using Lasagna100, a Python-based image viewer equipped with a plugin tailored for this task. Traced probe track data were uploaded to an Alyx server101 (a database designed for experimental neuroscience laboratories). Neuropixels channels were then manually aligned to anatomical features along the trajectory using electrophysiological landmarks with a custom electrophysiology alignment tool102,103. For full details, see appendix 6 of ref. 25.

Spike sorting

The spike-sorting pipeline used at IBL is described in detail in ref. 28. In brief, spike sorting was performed using a modified version of the Kilosort 2.5 algorithm14. We found that it was necessary to improve the original code in several aspects (scalability, reproducibility and stability, as discussed in ref. 25); therefore, we developed an open-source Python port (the code repository is provided in ref. 104).

Inclusion criteria

We applied a set of inclusion criteria to sessions, probes and neurons to ensure data quality. Supplementary Table 1 lists the consequences of these criteria for the number of sessions and probes that passed the criteria.

Sessions and insertions

Each Neuropixels insertion was repeated in at least two laboratories, with reproducibility of outcomes across laboratories verified with extensive analyses that we have previously reported25.

Sessions were included in the data release if the mice performed at least 250 trials, with a performance of at least 90% correct on 100% contrast trials for both left and right blocks, and, to be able to analyse the feedback variable, if there were at least 3 trials with incorrect choices (after applying the trial exclusions below). Furthermore, sessions were included in the release only if they reached threshold on a collection of hardware tests (definitions are available from GiHub (https://int-brain-lab.github.io/iblenv/_autosummary/ibllib.qc.task_metrics.html)).

Insertions were excluded if the neural data failed the whole recording per visually assessed criteria of the ‘Recording Inclusion metrics and Guidelines for Optimal Reproducibility’ (RIGOR) from ref. 25, by presenting major artefacts (see examples in ref. 28) or if the probe tract could not be recovered during the histology procedure. Furthermore, only insertions for which alignments had been resolved (see appendix 6 of ref. 25 for definitions) were used in this study.

After applying these criteria, a total of 459 sessions, 699 insertions and 621,733 neurons remained, constituting the publicly released dataset.

Trials

For the analyses presented here, trials were excluded if one of the following trial events could not be detected: choice, probabilityLeft, feebackType, feeback times, stimON times and firstMovement times. Trials were further excluded if the time between stimulus onset and the first movement of the wheel (the first wheel-movement time) were outside the range of 0.08–2.00 s.

Neurons and brain regions

Neurons generated by the spike-sorting pipeline were excluded from the analyses presented here if they failed one of the three criteria described in ref. 28 (the single unit computed metrics of RIGOR25): amplitude > 50 μV; noise cut-off < 20 μV; and refractory period violation. Neurons that passed these criteria were termed well-isolated neurons (or often just ‘neurons’) in this study. Out of the 621,733 units collected, 75,708 were considered well-isolated neurons. Final analyses were additionally restricted to regions that were designated grey matter in the adult mouse Allen Common Coordinate framework6, contained at least five well-isolated neurons per session and were recorded from in at least two such sessions.

Video analysis

We briefly describe the video analysis pipeline (full details can be found in ref. 105). The recording rigs contained three cameras: one called ‘left’ at full resolution (1,280 × 1,024) and 60 Hz, filming the mouse from one side; one called ‘right’, filming the mouse symmetrically from the other side at half resolution (640 × 512) and 150 Hz; and one called ‘body’ at half resolution and 30 Hz, filming the body of the mouse from above. We developed several quality-control metrics to detect raw video issues such as poor illumination (as infrared light bulbs broke) or accidental misplacement of the cameras105.

We computed the motion energy (the mean across pixels of the absolute value of the difference between adjacent frames) of the whisker pad areas in the ‘left’ and ‘right’ videos (Fig. 1d). The whisker pad area was empirically defined using a rectangular bounding box anchored between the nose tip and the eye, both found using DeepLabCut106 (DLC; see more below). This metric quantifies motion in the whisker pad area and has a temporal resolution of the respective camera.

We also performed markerless pose estimation of body parts using DLC31, which is used in a fully automated pipeline in IBL (v.2.1) to track various body parts such as the paws, nose, tongue and pupil (Fig. 1d). In all analyses using DLC estimates, we excluded predictions with likelihood < 0.9. Furthermore, we developed several quality-control metrics for the DLC traces105.

RF mapping

At the end of the behavioural task session, for most of the recordings (504 out of 699 insertions), we performed an RF mapping experiment for 5 min. During the RF mapping phase, visual stimuli were random square pixels in a 15 × 15 grid occupying 120° of visual angle both horizontally and vertically. There were three possible colours for each pixel: white, grey and dark. The colour of pixels randomly switched at the frame rate of 60 Hz, with an average duration of around 100 ms (Supplementary Fig. 5a).

To compute the RF, we identified the moments when colour switching occurred for each pixel. We defined the moments when colour brightness increased as the on stimulus onset time, which included the transition from dark to grey and grey to white. Similarly, we defined the moments when colour brightness decreased as the off stimulus onset time, which included the transition from grey to dark and white to grey. We then computed the average spike rate aligned with on and off stimulus onset for each pixel, from 0 to 100 ms. We defined two types of RFs, on and off, as the average on and off spike rates, respectively, across pixels.

To estimate the significance of the RF, we fitted the RF to a 2D Gaussian function then compared the variance explained to the fitting of a randomly shuffled receptive field (200 shuffles) and computed the P value of significance. We defined a neuron as having a significant RF if either an on or off RF had P < 0.01.

Assessing significance

In this work, we studied the neural correlates of task and behavioural variables. To assess the significance of these analyses, we needed to properly account for spurious correlations. Spurious correlations can be induced in particular by slow continuous drift in the neurophysiological recordings due to various factors, including movement of the Neuropixels probes in the brain. Such slow drifts can create temporal correlations across trials. Because standard correlation analyses assume that all samples are independent, they can produce apparently significant nonsense correlations even for signals that are completely unrelated107,108.

Null distributions were generated, which we used to test the significance of our results. Specifically, we used distinct null distributions for each of the three types of variables we considered: a discrete behaviour-independent variable (the stimulus side); discrete behaviour-dependent variables (for example, reward and choice); and continuous behaviour-dependent variables (for example, wheel speed and wheel velocity). For the rest of the section, we denote the aggregated neural activity across L trials and N neurons by \(S\in {{\mathbb{R}}}^{L\times N}\), and denote the vector of scalar targets across all trials by \(C\in {{\mathbb{R}}}^{L}\).

For the discrete behaviour-independent variable, we generated the null distribution from so-called pseudo-sessions. These are sessions generated from the same generative process as the one used for the mice. This process ensured that the time series of trials in each pseudo-session shared the same summary statistics as the ones used in the experiment. We generated M (typically M = 200) pseudo-targets \({\widetilde{C}}_{i},i\in [1,M]\), and performed the given analysis on the pair \((S,{\widetilde{C}}_{i})\) and obtained a fit score \({\widetilde{F}}_{i}\). In pseudo-sessions, the neural activity S should be independent of \({\widetilde{C}}_{i}\) as the mouse did not see \({\widetilde{C}}_{i}\) but rather C. Any predictive power from \({\widetilde{C}}_{i}\) to S (or from S to \({\widetilde{C}}_{i}\)) would arise, for instance, from slow drift in S unrelated to the task itself. These pseudo-scores \({\widetilde{F}}_{i}\) were compared with the actual score F obtained from the neural analysis on \((S,C)\) to assess significance.

For discrete behaviour-dependent variables (such as choice or reward), we could not use the pseudo-session procedure above as we did not know the underlying generative process in the mouse. We therefore used ‘synthetic’ sessions to create a null distribution. These depended on a generative model of the process governing the choices of the animals. In turn, this required a model of how the animals estimated the prior probability that the stimulus appears on the right or left side of the screen, along with a model of its response to different contrasts given this estimated prior. In a companion paper on the subjective prior24, we found that the best model of the prior across all animals uses a running average of the past actions as a subjective prior of the side of the next stimulus, which we refer to as the ‘action-kernel’ model. The subjective prior πt follows the update rule:

$${\pi }_{t+1}|{\pi }_{t},{a}_{t},\alpha =(1-\alpha )\cdot {\pi }_{t}+\alpha \cdot {\mathbb{I}}({a}_{t} > 0)$$

with \({a}_{t}\in \{-1,1\}\) {(left, right)} the action performed by the mouse on trial t and α the learning rate, which we fitted on a session-by-session basis. This effectively modelled how mice use information from previous trials to build a subjective prior of where the stimulus is going to appear at the next trial. The details of how this prior is integrated with the stimulus to produce a decision policy is described in the companion paper24.

We fit the parameters of this model of the mouse’s decision-making behaviour separately for each session and then created ‘synthetic’ targets \({\widetilde{C}}_{i}\) for that session by applying the model (with those fitted parameter values) to stimuli generated from pseudo-sessions to obtain time series of choice and reward. Then, as for the pseudo-sessions above, we obtained pseudo-scores \({\widetilde{F}}_{i}\) based on \((S,{\widetilde{C}}_{i})\) and assessed significance by comparing the distribution of pseudo-scores to the actual score F obtained from the neural analysis on (S, C).

For the third type of variable—continuous behaviour-dependent variables such as wheel speed—generating synthetic sessions was harder as we did not have access to a reasonable generative model of these quantities. We instead used what we call ‘imposter’ sessions, which were generated from the continuous behaviour-dependent variable from another mouse on another session. In detail, an imposter session for an original session of L trials was generated by performing the following steps:

1.

Concatenating trials across all sessions analysed in this study (leaving out the session under consideration).

2.

Randomly selecting a chunk of L consecutive trials from these concatenated sessions.

3.

Returning the selected chunk, the imposter session.

The continuous behaviour-dependent variable could then be extracted from the imposter session. As with the pseudo-sessions and the synthetic sessions, we obtained pseudo-scores \({\widetilde{F}}_{i}\) from a collection of imposter sessions and assessed significance by comparing the distribution of pseudo-scores to the actual score F obtained from the neural analysis on \((S,C)\).

To apply the linear shift method79,109 to compare spiking with movement variables, we first truncated both the movement and spiking time series by removing n = 20 samples from both ends of both time series. We computed the Pearson’s correlation coefficient of the central segments and compared the square of this coefficient to a null distribution obtained by repeatedly shifting the spiking time series linearly from the beginning to the end of the full behavioural time series. Significance was assessed using the approximate criterion, rejecting the null with significance α = 0.05 if the unshifted correlation was in the top α(2n + 1) of the shifted values.

Additional information about assessing significance for individual analyses are detailed in the analysis-specific sessions below. For decoding, single-cell and population trajectory analyses, the results come in the form of per-region P values. We used the FDR to correct for comparisons across all the regions involved in each analysis (201 for the main figures) at a level of q = 0.01. We used the Benjamini–Hochberg procedure110 as we expected substantial independence among the tests. As noted, we were not able to assess significance for the encoding analysis because of a lack of a convenient null distribution.

Overview of decoding

We performed a decoding analysis to measure how much information the activity of populations of neurons contained about task variables such as stimulus side and choice. To do this we, used cross-validated, maximum-likelihood regression with L1 regularization (to zero out the contribution from noisy neurons). The neural regressors were defined by binning the spike counts from each neuron in each session in a given region in a specific time window on each trial. The duration of the time window, the number of bins in that time window (that is, the bin size) and the trial event to which it was aligned depended on the variable that is the target of our regression (Supplementary Table 2). These factors are discussed further below and include a variety of behavioural and task variables: stimulus side, choice, feedback and wheel speed and velocity. Although a session may have included multiple probe insertions, we did not perform decoding on these probes separately because they are not independent. Instead, neurons in the same session and region were combined across probes for our decoding analysis. Decoding was cross-validated and compared with a null distribution to test for significance. A given region may have been recorded on multiple sessions; therefore, in the main figures (Figs. 47) the region P value was defined by combining session P values using Fisher’s combined probability test, and the region effect size was defined by subtracting the median of the null distribution from the decoding score and reporting the median of the resulting values across sessions. The P values for all regions were then subjected to FDR correction for multiple comparisons at q = 0.01.

Decoding target variables

Stimulus side, choice and feedback were treated as binary target variables for logistic regression. For stimulus side, trials that had zero contrast were excluded. We used the LogisticRegression module from scikit-learn111 (v.1.1.2) with 0.001 tolerance, 20,000 maximum iterations, “l1” penalty, “liblinear” solver and “fit_intercept” set to True. We balanced decoder classes by weighting samples by the inverse of the class frequency, 1/(2Pi,class). Decoding performance was evaluated using the balanced accuracy of classification, which is the average of the recall probabilities for the two classes. Supplementary Figure 9 shows histograms of the regression coefficients for all the variables.

Wheel values (speed and velocity) change over the course of a trial, unlike the previous decoding targets, and we therefore had to treat these target variables differently. We averaged wheel values in nonoverlapping 20-ms bins, starting 200 ms before first wheel-movement time and ending at 1,000 ms after first wheel-movement time. Spike counts were similarly binned. The target value for a given bin (ending at time t) was decoded from spikes in a preceding (causal) window spanning W bins (ending at times t, …, t-W). Therefore, if decoding from n neurons, there were (W + 1)n predictors of the target variable in a given bin. In practice, we used W = 10. To decode these continuous-valued targets, we performed linear regression using the Lasso module from scikit-learn111 (v.1.1.2) with 0.001 tolerance, 1,000 maximum iterations and “fit_intercept” set to True. Decoding performance was evaluated using the R2 metric.

Decoding cross-validation

We performed all decoding using nested cross-validation. Each of five outer folds was based on a training and validation set comprising 80% of the trials and a test set of the remaining 20% of trials. We selected trials at random in an interleaved manner. The training and validation set of an outer fold was itself split into five inner folds, again using an interleaved 80:20% partition. When logistic regression was performed, the folds had to be selected such that the trials used to train the decoder included at least one example of each class. Because both outer and inner folds were selected at random, it was possible that this requirement was not met. In those circumstances, we re-sampled the outer or inner folds. Likewise, we disallowed pseudo and synthetic sessions that had too few class examples. We fit regression models on the 80% training set of the inner fold using regularization coefficients (10−5, 10−4, 10−3, 10−2, 10−1, 100 and 101) for logistic regression (input parameter C in sklearn) and (10−5, 10−4, 10−3, 10−2 and 10−1) for linear regression (input parameter α in sklearn). We then used each model to predict targets on the remaining 20% of the trials of the inner fold (that is, the validation set). We repeated this procedure such that each trial in the original training and validation set of the outer fold was used once for the validation set and four times for the training set. We then took the regularization coefficient that performed best across all validation folds and retrained a regression model using all trials in the training and validation set of the outer fold. This final model was used to predict the target variable on the 20% of trials in the test set of the outer fold. We repeated the above train–validate–test procedure five times, each time holding out a different 20% of test trials such that, after the five repetitions, each trial had been included in the test set exactly once and included in the training and validation set exactly four times. The concatenation of all test set predictions, covering 100% of the trials, was used to evaluate the decoding score.

We found that for some regions and sessions, the resulting decoding score was sensitive to the precise assignment of trials to different folds. Therefore, to provide additional robustness to this procedure, we repeated the full fivefold cross-validation over multiple separate runs, each of which used a different random seed for selecting the interleaved training, validation and test splits. We then took the average decoding score across all runs as the final reported decoding score. When decoding stimulus side, choice and feedback, we performed ten runs, and for decoding wheel speed and wheel velocity, we used two runs owing to the added computational burden of decoding the wheel values, which included multiple bins per trial.

To further reduce the sensitivity of decoding scores due to fold allocation, the companion prior paper24 used a minimum of 250 trials to perform decoding of a given session. We waived that requirement for the decoding analyses in this study to match the same neurons used in the other analyses. We found that relaxing this requirement only affected the significance of a small number of regions for each target variable (Supplementary Fig. 10).

Decoding significance testing with null distributions

We assessed the significance of the decoding score that resulted from the multirun cross-validation procedure by comparing it to those of a bespoke null distribution of decoding scores. To construct appropriate null distributions, we fixed the regressor matrices of neural activity and generated new vectors of target values that followed similar statistics (Supplementary Table 2), as described above. Once the new target values were generated, we carried out the full multirun cross-validation procedure described above to obtain a new decoding score. This was repeated multiple times to produce a null distribution of decoding scores: stimulus side, choice and feedback were repeated 200 times, whereas wheel speed and velocity were repeated 100 times to reduce the computational burden (Supplementary Table 3).

The null distribution was used to define a P value for each region–session pair, in which the P value was defined as 1 − ρ where ρ was the percentile relative to the null distribution. Each brain region was recorded in ≥2 sessions, and we used two different methods for summarizing the decoding scores across sessions: (1) the median-corrected decoding score among sessions, which was used as the effect size in the main figures (the values were corrected by subtracting the median of the decoding score of the null distribution); and (2) the fraction of sessions in which decoding was significant, that is if the P value was less than α = 0.05, which is shown in Extended Data Fig. 6. We combined session-wide P values using Fisher’s combined probability test (also known as the Fisher’s method32,33) when computing a single statistic for a region. Finally, the combined P value for a region was subjected to a FDR correction for multiple comparisons at q = 0.01. We note that the combined P value may be significant but the computed effect size may be negative. This is because many sessions used for decoding in that region may have been insignificant, thereby driving the effect size down, whereas a small number of sessions may have been significant, thereby causing the Fisher’s combined probability test to produce a significant combined P value.

Single-cell correlates of sensory, cognitive and motor variables

We quantified the sensitivity of single neurons to three task variables: visual stimulus (left versus right location of the visual stimulus); choice (left versus right direction of wheel turning); and feedback (reward versus non-reward). We computed the sensitivity metric for each task variable using the condition combined Mann–Whitney U-statistic1,112,113 (Supplementary Fig. 2a,c). Specifically, we compared the firing rates from those trials with one task-variable value V1 (for example, trials with the stimulus on the left side) to those with the other value V2 (for example, with the stimulus on the right side) while holding the values of all other task variables fixed. In this way, we could isolate the influence of individual task variables on neural activity. To compute the U-statistic, we first assigned numeric ranks to the firing rate observations in each trial. We then computed the sum of ranks R1 and R2 for the observations coming from n1 and n2 trials associated with the task-variable values V1 and V2, respectively. The U-statistic is defined as:

$$U=\min \left[{R}_{1}-\frac{{n}_{1}({n}_{1}+1)}{2},{R}_{2}-\frac{{n}_{2}({n}_{2}+1)}{2}\right].$$

(1)

The probability that the firing rate on V1 trials is different (greater or smaller) from the firing rate on V2 trials is computed as 1 − P, where P is given by

$$P=\frac{U}{{n}_{1}{n}_{2}},$$

(2)

which is equivalent to the area under the receiver operating characteristic curve114,115. The null hypothesis is that the distributions of firing rates on V1 and V2 trials are identical.

To obtain a single probability across conditions, we combined observations across different trial conditions j by a sum of U-statistic in these conditions1:

$$P=\frac{{\sum }_{j}{U}_{j}}{{\sum }_{j}{n}_{1,j}{n}_{2,j}}.$$

(3)

Here n1,j and n2,j are the numbers of V1 and V2 trials, respectively, in the condition j.

For the visual stimulus, we compared firing rate in trials with the stimulus on the left versus stimulus on the right during the time window 0–100 ms aligned to the stimulus-onset time. For choice, we compared firing rates in trials with the left versus right choice during the time window −100 to 0 ms aligned to the first wheel-movement time. For the feedback, we compared firing rate in trials with reward versus non-reward during the time window of 0–200 ms aligned to the feedback-onset time.

To estimate significance, we used a permutation test in which trial labels for one task variable were randomly permuted 3,000 times in each subset of trials with fixed values of all other task variables, and the Mann–Whitney U-statistic was computed for each permutation. We computed the P value for each task variable as the fraction of permutations with the statistic P greater than in the data. This approach controlled for correlations among task variables and allowed us to isolate the sensitivity of the neuron to a stimulus that is not due to sensitivity to block and choice and vice versa. Random permutations, however, do not control for spurious correlations that can arise owing to autocorrelations in the time series of the firing rate and task variable107. To control for spurious correlations, we used a within-block permutation test to simultaneously control for both temporal correlations and correlations among task variables. Specifically, we generated the null distribution by randomly permuting trial labels with fixed values of all other task variables in each individual block, which effectively reduced the serial dependencies of task variables at the time scale of block duration.

The combined condition Mann–Whitney U-statistic is known to have a relatively high false-positive rate owing to the limited number of trials in each condition. To obtain a sufficient number of trials, we also computed a simple Mann–Whitney U-statistic without separating different conditions. We defined P < 0.001 (αMW = 0.001) as the criterion of significance for the simple Mann–Whitney U-statistic, and P < 0.05 (αCCMW = 0.05) for the combined condition Mann–Whitney U-statistic. We defined neurons that were significant in both tests to be sensitive neurons for a specific task variable.

To quantify the overall responsiveness of single neurons to the behavioural task, we used the Wilcoxon rank-sum test to compare firing rates between the baseline (–200 to 0 ms window aligned to the stimulus onset) and the following different task periods: 50–150 ms and 0–400 ms aligned to the stimulus onset; –100 to 50 ms and –50 to 200 ms aligned to the first wheel-movement time; and 0–150 ms aligned to the reward delivery. These time windows are selected on the basis of the test of responsiveness in previous work on large-scale neural coding with a similar task structure1.

To measure the behavioural movement correlates of single neurons in the entire recording sessions, we computed zero time-lag Pearson’s correlation coefficients between time series of spike counts in 50-ms bins and time series of four behavioural variables (nose, pupil, paw and tongue) each extracted from videos of the mouse using DLC software31. To assess the significance of these correlations, we applied a time-shift test79 and computed 2K = 40 time-shifted correlations, varying the offset between time series of spiking activity and behavioural variables from 50 to 1,000 ms (both positive and negative offsets). We then counted the number of times m where the absolute value of time-shifted correlation exceeded that of zero time-lag correlation and assigned the P value as the fraction of the absolute value of permuted correlations greater than in the data P = m/(2K + 1). We then assigned each neuron as being significantly responsive relative to a particular threshold on this P value.

We then computed the fraction of neurons in each brain region that were significantly responsive to the behavioural task, movement, visual stimulus, choice and feedback, and identified brain regions that were most responsive to these conditions. Specifically, for each region, we computed the P value of the fraction of neurons (fi) in i-th session by comparing the fraction to a binomial distribution of fractions due to false-positive events: Binomial(Ni, α), where Ni is the number of neurons in i-th session, and α is the false-positive rate:

$$\alpha ={\alpha }_{{\rm{M}}{\rm{W}}}\times {\alpha }_{{\rm{C}}{\rm{C}}{\rm{M}}{\rm{W}}}=0.001\times 0.05,\,\text{for stimulus, choice, and feedback}$$

(4)

We defined the P value Pi as the probability of the fraction fi that is larger than the distribution Binomial(Ni, α). Next, we used Fisher’s combined probability test to compute a combined P value of each brain region by combining the P values of all sessions (i = 1, 2, … m).

After computing combined P values of each brain region, these P values were then subjected to the FDR procedure (Benjamini–Hochberg) at q = 0.01 to correct for multiple comparisons. We defined a list of regions to be significant on the basis of this FDR procedure.

Population trajectory analysis methods

We examined how responsive different brain regions were to a task variable v of interest. To do so, we constructed a pair of variable-specific supersessions (\({s}_{v},{s}_{v}^{{\prime} }\)): We partitioned all the IBL data into two, corresponding to the opposing pair of conditions for the variable (for example, for stimulus discrimination, we split the trials into the left and right stimulus conditions) and replaced the trial-by-trial responses of each cell in the condition and in each session with one trial-averaged response (Fig. 3e). These trials were aligned to a variable-specific reference time (for example, the stimulus-onset time for stimulus discrimination). We used the canonical time windows shown in Fig. 3a around the alignment time for the main figures unless stated otherwise (for example, for feedback, we used a longer time window in the temporal evolution plot to illustrate licking), time bins of length 12.5 ms and stride of 2 ms. The supersessions \({S}_{v},{S}_{v}^{{\prime} }\) had a number of rows equalling the number of IBL sessions that passed quality control for that variable condition times the number of cells per session; columns corresponded to time bins.

We then subdivided the supersessions by brain region r (\({S}_{v,r},{S}_{v,r}^{{\prime} }\)). These defined a pair of across-IBL response trajectories (temporal evolution of the response) to the pair of variable v conditions for each brain region.

We next computed the time-resolved difference in response of brain region r to the opposing conditions of task variable v. We restricted our analyses to regions with ≥20 rows in (\({S}_{v,r},{S}_{v,r}^{{\prime} }\)) for all analyses. Our primary distance metric, which we call dv,r(t), was computed as a simple Euclidean distance in neural space, normalized by the square root of the number of cells in the given region.

Given a time-resolved distance curve, we computed the maximum and minimum distances along the curve to define a variable-specific and region-specific modulation amplitude:

$${A}_{v,r}={\max }_{t}[{d}_{v,r}(t)]-{\min }_{t}[{d}_{v,r}(t)].$$

(5)

We obtained a variable-specific and region-specific response latency by defining it as the first time t at which \({d}_{v,r}(t)={\min }_{t}[{d}_{v,r}(t)]\,+\,\)\(0.7({\max }_{t}[{d}_{v,r}(t)]-{\min }_{t}[{d}_{v,r}(t)])\). Using modulation amplitude as a measure of effect size, we then quantified the combined modulation amplitude and latency of regions as a function of task variable.

To generate a significance measure for the variable-specific and region-specific distance measures, we used a pseudo-trial method for generating null distance distributions, as described below. Distances were significant if they were greater in size than the corresponding null distance distribution with P < 0.01. Although the significance of regions was therefore controlled for the effects of other task variables, note that the distance amplitudes and latencies were not.

Below we list the three task variables examined and the associated null distributions:

Stimulus supersession: \({S}_{v},{S}_{v}^{{\prime} }\) corresponded to trials with the stimulus on the left or right, respectively, aligned by the stimulus-onset time and including 0 ms before to 150 ms after onset. To generate pseudo-trials, we permuted the stimulus side labels among trials that shared the same block and choice side.

Choice supersession: \({S}_{v},{S}_{{v}^{{\prime} }}^{{\prime} }\) corresponded to trials with the animal’s response (wheel movement) to the left or right, respectively, aligned by the first wheel-movement time and including 0 ms before to 150 ms after onset. To generate pseudo-trials, we permuted the choice labels among trials with the same block and stimulus side.

Feedback supersession: \({S}_{v},{S}_{v}^{{\prime} }\) corresponded to trials in which the animal’s response was correct (recall that the feedback was water delivery) or incorrect (recall that the feedback was tone and time out delivery), respectively, aligned by feedback onset and including 0 ms before to 150 ms after onset. To generate pseudo-trials, we permuted the choice labels among trials with the same block and stimulus side and then compared these pseudo-choices with the true stimulus sides to obtain pseudo-feedback types.

For each \({S}_{v,r},{S}_{v,r}^{{\prime} }\) pair, we repeated the pseudo-trial process M (1,000) times, then followed the same distance computation procedures described above to obtain a null distribution of M modulation amplitude scores. We obtained a P value by counting n (as the number of pseudo-scores that were greater than the true score for this region) as: \(P=\frac{n+1}{M+1}\).

For regions with significant and large effect sizes to a given variable, we generated visualizations of the population dynamics by projecting the trajectories in \({S}_{v,r},{S}_{v,r}^{{\prime} }\) into a low-dimensional subspace defined by the first three principal components of the pair \({S}_{v,r},{S}_{v,r}^{{\prime} }\). In addition to the main figure results, population trajectory results on the maximal dataset are shown in Extended Data Fig. 10.

Multiple linear regression model of single-neuron activity

We fit linear regression models to single-neuron activity, measured as spikes binned into 20-ms intervals. These models aimed to express {slt}, the neural activity in time bin t ∈ [1, T] on trial l ∈ [1, L] based on D time-varying task-related regressors \(X\in {{\mathbb{R}}}^{L,T,D}\). We first represented the regressors across time using a basis of raised cosine ‘bump’ functions in log space116. Each basis function was associated with a weight in the regression model, with the value of the basis function at time t described by \(\cos \left(\frac{2(t-\tau ){\rm{\pi }}}{2w}+\frac{1}{2}\right)\). The basis functions were computed in log space and then mapped into linear time to more efficiently capture both fast neuronal responses in the  <100-ms range and slow changes beyond that time (Fig. 3c). The width w and centre τ of each basis were chosen to ensure even coverage of the total duration of the kernel. In an example kernel with three bases, three separate weights would be fit to the event in question with weights describing early, middle and late activity predicted by the event. These bases were convolved with a vector describing the effects of each regressor. In the case of timing events, the bases were convolved with a Kronecker delta function, which resulted in a copy of the kernel at each time when the event occurred. We describe the simple case that each regressor has the same number B of basis functions. This produced a new regression tensor \(\widehat{X}\in {{\mathbb{R}}}^{L,T,D,B}.\)

We then sought regression weights \(\beta \in {{\mathbb{R}}}^{D,B}\) such that, as closely as possible, \({s}_{lt}={\beta }_{0}+{\sum }_{d,b}{\beta }_{db}{\widehat{x}}_{ltdb}\), where {βdb} are linear regression weights. Each single-neuron model used regressors for stimulus onset (left and right separately), first wheel-movement time (left or right), correct feedback, incorrect feedback, value of the block probability, movement initiation and wheel speed. Fitting was performed using an L2-penalized objective function (as implemented in the scikit-learn Python ecosystem as \(\Vert {\bf{s}}-{\beta }_{0}-\widehat{X}\cdot \beta {\Vert }_{2}^{2}+\alpha \times \Vert \beta {\Vert }_{2}^{2}\)), with the weight of the regularization α determined through cross-validation. Note that the intercept of the model is not included in the regularization to capture fully the mean of the distribution of s.

We used a kernel composed of five basis functions to parameterize left and right stimulus onset, and correct and incorrect feedback. These bases spanned 400 ms and corresponded to 5 weights per regressor for each of these 4 regressors in the model.

Previous work has shown that difficulty in perceptual decision-making tasks117, along with neural responses, does not change linearly with contrast. To account for this, we modulated the height of the stimulus-onset kernels as a function of contrast c with height \(h=\frac{\tanh 5c}{\tanh 5}\). The resulting kernels would produce a response that was lower at low contrasts for the same set of weights {βdb}.

To capture statistical dependencies between wheel movements and spiking, we used anticausal kernels (in which the convolution of signal and kernel produces a kernel peak before peaks in the signal) describing the effect of first wheel-movement time for leftward and rightward movements. These kernels described 200 ms of activity preceding first movement using 3 basis functions. We also used an additional anticausal kernel of 3 bases covering 300 ms describing the effect of wheel speed, and was convolved with the trace of wheel speed for each trial. With these regressors, we aimed to capture preparatory signals that preceded movements related to the wheel.

Models were fit on a per-neuron basis with the L2 objective function using fivefold cross-validation. Trials for cross-validation were chosen from a uniform distribution, and not in contiguous blocks. Models were then fit again using a leave-one-out paradigm, with each set of regressor weights βd1…βdB being removed as a group and the resulting model fit and scored again on the same folds. The change between the base model score \({R}_{{\rm{full}}}^{2}\) and the omission model \({R}_{-{\rm{regressor}}}^{2}\) was computed as \(\Delta {R}_{{\rm{regressor}}}^{2}={R}_{{\rm{full}}}^{2}-{R}_{-{\rm{regressor}}}^{2}\). Moreover, the sensitivity for several pairs of associated regressors, such as left or right stimulus onset and correct and incorrect feedback, were defined as \(\log | \Delta {R}_{A}^{2}-\Delta {R}_{B}^{2}| \). This computation was applied to the following pairs: right and left stimulus, right and left first wheel-movement time, and correct and incorrect feedback.

Granger analysis across simultaneously recorded regions

Granger causality has been suggested as a statistically principled technique to estimate directed information flow from a pair of time series118. We used nonparametric spectral Granger causality119, implemented in Python120, to compute a Granger score for all simultaneously recorded region pairs in the IBL’s brain-wide dataset.

For a given session, binned spikes (12.5-ms bin size) from both probes were averaged across regions to obtain a firing rate time series for the complete recording (excluding regions with fewer than ten neurons per recording). These series (typically 1.5-h long) were then divided in nonoverlapping 10-s segments (irrespective of task contingencies or alignment), which resulted in a data input of shape no. of regions × no. of segments × no. of observations from which a Granger score as a function of frequency was computed for each directed region pair with the Spectral Connectivity Python package120. We obtained a single Granger score per directed region pair by averaging across frequencies121.

Significance for a Granger score and region pair for a given session was established using a permutation test. That is, a null distribution of pseudo Granger scores was obtained by randomly swapping the two region labels across segments. A total of 1,000 of these pseudoscores were computed, and a P value was obtained by counting the number of pseudoscores that were greater than the true Granger score and dividing this count by the number of pseudoscores plus 1. P values across all Granger scores were corrected for multiple comparison using the Benjamini–Yekutieli method. Measurements were combined across sessions by taking the mean Granger score and using Fisher’s combined probability test to combine the P values.

Visualization and comparison of results across neural analyses

To facilitate comparisons of neural analyses across brain regions, for each task variable, we visualized effect sizes in a table (for example, Fig. 4f), specifying the effect size for each analysis and brain region. Cells of the table were coloured according to effect size using the same colour map as in the corresponding flatmap. Before summing, the effect sizes for each analysis were normalized to lie in the interval from 0 to 1. This method highlights regions with large effects across all analyses and indicates the extent to which the analyses agree. For a direct comparison of analyses scores, see flatmaps in Extended Data Fig. 2 and scatter plots of scores for analysis pairs in Extended Data Fig. 3.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.