We next turned our attention to annotating the identities of the recorded neurons in brain-wide calcium imaging data. C. elegans neurons have fairly stereotyped positions in the heads of adult animals, though fully accurate inference of neural identity from position alone has not been shown to be possible. Fluorescent reporter gene expression using well-defined genetic drivers can provide additional information. The NeuroPAL strain is especially useful in this regard. It expresses pan-neuronal NLS-tagRFP, but also has expression of NLS-mTagBFP2, NLS-CyOFP1, and NLS-mNeptune2.5 under a set of well-chosen genetic drivers (example image in Figure 4A Yemini et al., 2021). With proper training, humans can manually label the identities of most neurons in this strain using neuron position and multi-spectral fluorescence. For most of the brain-wide recordings collected using our calcium imaging platform, we used a previously characterized strain with a pan-neuronal NLS-GCaMP7F transgene crossed into NeuroPAL (Atanas et al., 2023). While freely moving recordings were conducted with only NLS-GCaMP and NLS-tagRFP data acquisition, animals were immobilized at the end of each recording to capture multi-spectral fluorescence. Humans could manually label many neurons’ identities in these multi-spectral images, and the image registration approaches described above could map the ROIs in the immobilized data to ROIs in the freely moving recordings in order to match neuron identity to GCaMP traces.

The AutoCellLabeler Network can automatically annotate >100 neuronal cell types in the C. elegans head.

(A) Procedure by which AutoCellLabeler generates labels for neurons. First, the tagRFP component of a multi-spectral image is passed into a segmentation neural network, which extracts neuron ROIs, labeling each pixel as an arbitrary number with one number per neuron. Then, the full multi-spectral image is input into AutoCellLabeler, which outputs a probability map. This probability map is applied to the ROIs to generate labels and confidence values for those labels. The network cuboid diagrams are represented as in Figure 1A. (B) AutoCellLabeler’s training data consists of a set of multi-spectral images (NLS-tagRFP, NLS-mNeptune2.5, NLS-CyOFP1, and NLS-mTagBFP2), human neuron labels, and a pixel weighting matrix based on confidence and frequency of the human labels that controls how much each pixel is weighted in AutoCellLabeler’s loss function. (C) Pixel-weighted cross-entropy loss and pixel-weighted IoU metric scores for training and validation data. Cross-entropy loss captures the discrepancy between predicted and actual class probabilities for each pixel. The IoU metric describes how accurately the predicted labels overlap with the ground truth labels. (D) During the label extraction procedure, AutoCellLabeler is less confident of its label on pixels near the edge of ROI boundaries. Therefore, we allow the central pixels to have much higher weight when determining the overall ROI label from pixel-level network output. (E) Distributions of AutoCellLabeler’s confidence across test datasets based on the relationship of its label to the human label (‘Correct’=agree, ‘Incorrect’=disagree, ‘Human low conf’=human had low confidence, ‘Human no label’ – human did not even guess a label for the neuron). ****p<0.0001, as determined by a Mann-Whitney U Test between the indicated condition and the ‘Correct’ condition where the network agreed with the human label; n=835, 25, 322, 302 labels (from 11 animals) for the conditions ‘Correct’, ‘Incorrect’, ‘Human low conf’, ‘Human no label’, respectively; U=16,700, 202,691, 210,797 for ‘Incorrect’, ‘Human low conf’, ‘Human no label’ vs ‘Correct’, respectively. (F) Categorization of neurons in test datasets based on AutoCellLabeler’s confidence. Here ‘Correct’ and ‘Incorrect’ are as in (E), but ‘No human label’ also includes low-confidence human labels. Printed percentage values are the accuracy of AutoCellLabeler on the corresponding category, computed as correctcorrect+incorrect (G) Distributions of accuracy of AutoCellLabeler’s high confidence (>75%) labels on neurons across test datasets based on the confidence of the human labels. n.s. not significant, *p<0.05, as determined by a paired permutation test comparing mean differences (n=11 test datasets). (H) Accuracy of AutoCellLabeler compared with high-confidence labels from new human labelers on neurons in test datasets that were labeled at low confidence, not at all, or at high confidence by the original human labelers. Error bars are bootstrapped 95% confidence intervals. A dashed red line shows accuracy of new human labelers relative to the old human labelers, when both gave high confidence to their labels. There was no significant difference between the human vs human accuracy and the network accuracy for any of these categories of labels, determined via two-tailed empirical p-values from the bootstrapped distributions. (I) Distributions of number of high-confidence labels per animal over test datasets. High confidence was 4–5 for human labels and >75% for network labels. We note that we standardized the manner in which split ROIs were handled for human- and network-labeled data so that the number of detected neurons could be properly compared between these two groups. n.s. not significant, ***p<0.001, as determined by a paired permutation test comparing mean differences (n=11 animals). (J) Distributions of accuracy of high-confidence labels per animal over test datasets, relative to the original human labels. A paired permutation test comparing mean differences to the full network’s label accuracy did not find any significance. (K) Number of ROIs per neuron class labeled at high confidence in test datasets that fall into each category, along with average confidence for all labels for each neuron class in those test datasets. ‘New’ represents ROIs that were labeled by the network as the neuron and were not labeled by the human. ‘Correct’ represents ROIs that were labeled by both AutoCellLabeler and the human as that neuron. ‘Incorrect’ represents ROIs that were labeled by the network as that neuron and were labeled by the human as something else. ‘Lost’ represents ROIs that were labeled by the human as that neuron and were not labeled by the network. ‘Network conf’ represents the average confidence of the network for all its labels of that neuron. ‘Human conf’ represents the average confidence of the human labelers for all their labels of that neuron. Neuron classes with high values in the ‘Correct’ column and low values in the ‘Incorrect’ column indicate a very high degree of accuracy in AutoCellLabeler’s labels for those classes. If those classes also have a high value in the ‘New’ column, it could indicate that AutoCellLabeler is able to find the neuron with high accuracy in animals where humans were unable to label it.

Manual annotation of NeuroPAL images is time-consuming. First, to perform accurate labeling, the individual needs substantial training. Even after being trained, labeling all ROIs in one NeuroPAL animal can take 3–5 hr. In addition, different individuals have different degrees of knowledge or confidence in labeling certain cell classes. For these reasons, it was desirable to automate NeuroPAL labeling using datasets that had previously been labeled by a panel of human labelers. In particular, the labels that they provided with a high degree of confidence would be most useful for training an automated labeling network. Previous studies have developed statistical approaches for semi-automated labeling to label neural identity from NeuroPAL images, but the maximum precision that we are aware of is 90% without manual correction (Yemini et al., 2021).

We trained a 3-D U-Net (Wolny et al., 2020) to label the C. elegans neuron classes in a given NeuroPAL 3-D image. As input, the network received four fluorescent 3-D images from the head of each worm: pan-neuronal NLS-tagRFP, plus the NLS-mTagBFP2, NLS-CyOFP1, and NLS-mNeptune2.5 images that label stereotyped subsets of neurons (Figure 4A). During training, the network also received the human-annotated labels of which pixels belong to which neurons. Humans provided ROI-level labels and the boundaries of each ROI were determined using a previously described neuron segmentation network (Atanas et al., 2023) trained to label all neurons in a given image (agnostic to their identity). Finally, during training, the network also received an array indicating the relative weight to assign each pixel during training (Figure 4B). This was incorporated into a pixel-weighted cross-entropy loss function (lower values indicate more accurate labeling of each pixel), summing across the pixels in a weighted manner. Pixel weighting was adjusted as follows: (1) background was given extremely low weight; (2) ROIs that humans were not able to label were given extremely low weight; (3) all other ROIs received higher weight proportional to the subjective confidence that the human had in assigning the label to the ROI and the rarity of the label (see Materials and methods for exact details). Regarding this latter point, neurons that were less frequently labeled by human annotation received higher weight so that the network could potentially learn how to classify these neurons from fewer labeled examples.

We trained the network over 300 epochs using a training set of 81 annotated images and a validation set of 10 images (Figure 4C). Because the size of the training set was fairly small, we augmented the training data using both standard image augmentations (rotation, flipping, adding Gaussian noise, etc.) and a custom augmentation where the images were warped in a manner to approximate worm head bending (see Materials and methods). Overall, the goal was for this network to be able to annotate neural identities in worms in any posture provided that they were roughly oriented length-wise in a 284 x 120 x 64 (x, y, z) image. Because this Automatic Cell Labeling Network (AutoCellLabeler) labels individual pixels, it was necessary to convert these pixel-wise classifications into ROI-level classifications. AutoCellLabeler outputs its confidence in its label for each pixel, and we noted that the network’s confidence for a given ROI was highest near the center of the ROI (Figure 4D). Therefore, to determine ROI-level labels, we took a weighted average of the pixel-wise labels within an ROI, weighing the center pixels more strongly. The overall confidence of these pixel scores was also used to compute an ROI-level confidence score, reflecting the network’s confidence that it labeled the ROI correctly. Finally, after all ROIs were assigned a label, heuristics were applied to identify and delete problematic labels. Labels were deleted if (1) the network already labeled another ROI as that label with higher confidence; (2) the label was present too infrequently in the network’s training data; (3) the network labeled that ROI as something other than a neuron (e.g. a gut granule or glial cell, which we supplied as valid labels during training and had labeled in training data); or (4) the network confidently predicted different parts of the ROI as different labels (see Materials and methods for details).

We evaluated the performance of the network on 11 separate datasets that were reserved for testing. We assessed the accuracy of AutoCellLabeler on the subset of ROIs with high-confidence human labels (subjective confidence scores of 4 or 5, on a scale from 1 to 5). On these neurons, average network confidence was 96.8% and its accuracy was 97.1%. We furthermore observed that the network was more confident in its correct labels (average confidence 97.3%) than its incorrect labels (average confidence 80.7%; Figure 4E). More generally, AutoCellLabeler confidence was highly correlated with its accuracy (Figure 4F shows a breakdown of cell labeling at different confidences, with an inset indicating accuracy). Indeed, excluding the neurons where the network assigns low (<75%) confidence increased its accuracy to 98.1% (Figure 4—figure supplement 1A displays the full accuracy-recall tradeoff curve). Under this confidence threshold cutoff, AutoCellLabeler still assigned a label to 90.6% of all the ROIs that had high-confidence human labels, so we chose to delete the low-confidence (<75%) labels altogether (see Figure 4—figure supplement 1A for rationale for the 75% cutoff value).

We also examined model performance on data where humans had either low confidence or did not assign a neuron label. In these cases, it was harder to estimate the ground truth. Overall, model confidence was much lower for neurons that humans labeled with low confidence (87.3%) or did not assign a label (81.3%). The concurrence of AutoCellLabeler relative to low-confidence human labels was also lower (84.1%; we note that this is not really a measure of accuracy since these ‘ground-truth’ labels had low confidence). Indeed, overall the network’s concurrence versus human labels scaled with the confidence of the human label (Figure 4G).

We carefully examined the subset of ROIs where the network had high confidence (>75%), but humans had either low confidence or entered no label at all. This was quite a large set of ROIs: AutoCellLabeler identified significantly more high-confidence neurons (119/animal) than the original human labelers (83/animal), and this could conceivably reflect a highly accurate pool of network-generated labels exceeding human performance. To determine whether this was the case, we obtained new human labels by different human labelers for a random subset of these neurons. Whereas some human labels remained low-confidence, others were now labeled with high confidence (20.9% of this group of ROIs). The new human labelers also labeled neurons that were originally labeled with high confidence so that we could compare the network’s performance on relabeled data where the original data was unlabeled, low confidence, or high confidence. AutoCellLabeler’s performance on all three groups was similar (88%, 86.1%, and 92.1%, respectively), which was comparable to the accuracy of humans relabeling data relative to the original high-confidence labels (92.3%; Figure 4H). The slightly lower accuracy on these re-labeled data is likely due to the human labeling of the original training, validation, and testing data being highly vetted and thoroughly double-checked, whereas the re-labeling that we performed just for this analysis was done in a single pass. Overall, these analyses indicate that the high-confidence network labels (119/animal) have similar accuracy regardless of whether the original data had been labeled by humans as unlabelable, low confidence, or high confidence. We note that this explains a subset of the cases where human low-confidence labels were not in agreement with network labels (Figure 4G). Taken together, these observations indicate that AutoCellLabeler can confidently label more neurons per dataset than individual human labelers.

We also split out model performance by cell type. This largely revealed similar trends. Model labeling accuracy and confidence were variable among the neuron types, with highest accuracy and confidence for the cell types where there were higher confidence human labels and a higher frequency of human labels (Figure 4K). For the labels where there were high confidence network and human labels, we generated a confusion matrix to see if AutoCellLabeler’s mistakes had recurring trends (Figure 4—figure supplement 1B). While mistakes of this type were very rare, we observed that the ones that occurred could mostly be categorized as either mislabeling a gut granule as the neuron RMG, or mislabeling the dorsal/ventral categorization of the neurons IL1 and IL2 (e.g. mislabeling IL2D as IL2). Together, these categories accounted for 50% of all AutoCellLabeler’s mistakes. We also observed that across cell types, AutoCellLabeler’s confidence was highly correlated with human confidence (Figure 4—figure supplement 1C), suggesting that the main limitations of model accuracy are due to limited human labeling accuracy and confidence.

To provide better insights into which network features were critical for its performance, we trained additional networks lacking some of AutoCellLabeler’s key features. To evaluate these networks, we considered both the number of high-confidence labels assigned by AutoCellLabeler and the accuracy of those labels measured against high-confidence human labels. Surprisingly, a network that was trained with only standard image augmentations (i.e. lacking the custom augmentation to bend the images in a manner that approximates a worm head bend) had similar performance (Figure 4I). However, a network that was trained without a pixel-weighting scheme (i.e. where all pixels were weighted equally) provided far fewer high-confidence labels. This suggests that devising strategies for pixel weighting is critical for model performance, though our custom augmentation was not important. Interestingly, all trained networks had similar accuracy (Figure 4J) on their high-confidence labels, suggesting that the network architecture in all cases is able to accurately assess its confidence.