Testing DeepSomatic’s ability to spot cancer-related variants
We trained DeepSomatic on three of the breast cancer genomes and the two lung cancer genomes in the CASTLE reference dataset. We then tested DeepSomatic’s performance in several ways, including on the single breast cancer genome that was not included in its training data, and on chromosome 1 from each sample, which we also excluded from the training.
Results show that DeepSomatic models developed for each of the three major sequencing platforms performed better than other methods, identifying more tumor variants with higher accuracy. The tools used for comparison on short-read sequencing data were SomaticSniper, MuTect2 and Strelka2 (with SomaticSniper specifically for single nucleotide variants, or SNVs). For long-read sequencing data we compared against ClairS, a deep learning model trained on synthetic data.
In our tests DeepSomatic identified 329,011 somatic variants across the six reference cell lines and a seventh preserved sample. DeepSomatic does particularly well at identifying cancer variations that involve insertions and deletions (“Indels”) of genetic code. For these types of variants, DeepSomatic substantially increased the F1-score, a balanced measure of how well the model finds true variants in a sample (recall) while not making false positives (precision). On Illumina sequencing data the next-best method scored 80% at identifying Indels, while DeepSomatic scored 90%. On Pacific Biosciences sequencing data, the next-best method scored less than 50% at identifying Indels, and DeepSomatic scored more than 80%.