UK Biobank’s Whole-Genome Sequencing Uncovers 1.5B Variants

Thank you. Listen to this article using the player above. ✖

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

The recent Nature publication of the UK Biobank whole-genome sequencing (WGS) dataset represents a landmark in population genomics. Comprising nearly 500,000 genomes and ~1.5 billion variants, this dataset substantially exceeds the resolution of previous large-scale resources, enabling novel insights into the genetic architecture of health and disease.

Illumina’s DRAGEN technology was at the center of this feat. To explore the technical underpinnings of the dataset’s creation and its broader implications, Technology Networks spoke with Rami Mehio, head of Software and Informatics at Illumina. Mehio discussed the innovations that enabled high-precision variant detection in complex genomic regions, the role of Illumina Connected Analytics in large-scale data processing and how platforms are evolving to support multiomics integration and AI-driven discovery.

Molly Coddington (MC):

A picture of Molly Coddington

Senior Writer and Newsroom Team Lead

Technology Networks

Molly Coddington is a Senior Writer and Newsroom Team Lead at Technology Networks. She holds a first-class honors degree in neuroscience. In 2021 Molly was shortlisted for the Women in Journalism Georgina Henry Award.

The UK Biobank WGS dataset published in Nature is one of the largest of its kind. What makes this dataset uniquely valuable compared to previous large-scale genomic resources?

Rami Mehio (RM):

Rami Mehio.

Head of Software and Informatics Development

Illumina

Mehio is the head of Software and Informatics development at Illumina. He joined Illumina in 2018 as part of the Edico Genome acquisition where he led the development of DRAGEN BioIT solution, as head of engineering.

The UK Biobank WGS dataset stands out as one of the most powerful resources for advancing medical research. Its scale and diversity enable unparalleled opportunities for disease risk prediction, diagnostics and drug target discovery.

Unlike smaller datasets, UK Biobank WGS allows researchers to explore the genetic underpinnings of a wide spectrum of common diseases and uncover complex relationships between genetics, biomarkers and environmental factors. Additionally, the dataset was processed using state-of-the-art variant calling methods, providing high accuracy and consistency – critical for maximizing discovery power and enabling robust comparisons across other biobanks.

MC:

A picture of Molly Coddington

Senior Writer and Newsroom Team Lead

Technology Networks

Illumina’s DRAGEN technology identified ~1.5 billion variants – significantly more than previous methods. Could you explain how DRAGEN achieves this increased sensitivity and specificity?

RM:

Rami Mehio.

Head of Software and Informatics Development

Illumina

The version of DRAGEN used to analyze the UK Biobank data – which is closely related to the one that won the PrecisionFDA Truth Challenge V2 in 2020 – represented a major leap forward in variant detection sensitivity.

A key innovation was the introduction of pangenome mapping. This approach allows DRAGEN to accurately align reads in difficult-to-map and highly polymorphic regions of the genome, while reducing ancestry-related reference bias. As a result, DRAGEN identifies tens of thousands more variants per sample compared to traditional pipelines. When scaled to the entire UK Biobank cohort, this translates to approximately 1.5 billion variants.

It’s worth noting that while this was groundbreaking five years ago, we’ve since made enormous improvements to the product in terms of accuracy, precision and many other critical metrics. This is likely a great motivator to re-analyze the data set with DRAGEN 4.4 to uncover even more insights from this resource. Additional details on the pangenome methodology can be found here.

MC:

A picture of Molly Coddington

Senior Writer and Newsroom Team Lead

Technology Networks

High-throughput sequencing analysis often involves trade-offs between accuracy, speed and cost. How does DRAGEN balance these factors, particularly at this unprecedented scale?

RM:

Rami Mehio.

Head of Software and Informatics Development

Illumina

DRAGEN is engineered to deliver high accuracy without compromising speed or cost efficiency – even at population scale.

At the sample level, it combines pangenome-based mapping and alignment with machine learning-driven variant calling, accelerated by A field-programmable gate array hardware. This architecture achieves state-of-the-art sensitivity and specificity while providing an order-of-magnitude improvement in compute efficiency, cost and speed. For joint calling, DRAGEN aggregates variants and genotypes in a way that preserves per-sample accuracy, regardless of cohort size, with near-linear scalability as sample numbers grow. Post-aggregation, machine learning models score variant quality based on sample-level distributions, ensuring high genotyping rates and consistency across trios and monozygotic twins.

All of this combines to deliver industry-leading precision alongside the throughput and cost-effectiveness required for projects like UK Biobank.

MC:

A picture of Molly Coddington

Senior Writer and Newsroom Team Lead

Technology Networks

One of the promises of WGS is detecting rare and structural variants. What new opportunities does this dataset open up for rare disease research and population-level variant interpretation?

RM:

Rami Mehio.

Head of Software and Informatics Development

Illumina

As the cohort expands from thousands to nearly 500,000
genomes, the number of common variants plateaus. But rare variants, including
ultra-rare alleles and singletons, continue to rise.

In the UK Biobank dataset, 47% of variants are singletons,
82% occur at frequencies below 1 in 100,000 and 57% are novel to the dbSNP database. Combined with UK
Biobank’s rich phenotypic data, this unprecedented collection enables discovery
of gene–phenotype associations that are otherwise missed by common-variant genome-wide
association study.

Furthermore, population-scale sequencing provides highly
accurate allele frequency estimates, a critical factor in distinguishing benign
from pathogenic variant, an essential step in clinical genomics. These insights
underscore the transformative role of large-scale sequencing in advancing rare
disease research and improving variant interpretation.

MC:

A picture of Molly Coddington

Senior Writer and Newsroom Team Lead

Technology Networks

With such a large dataset, data handling, storage and accessibility become critical. What informatics approaches or infrastructure have been put in place to ensure researchers worldwide can use the resource effectively?

RM:

Rami Mehio.

Head of Software and Informatics Development

Illumina

While the UK Biobank has built a secure, trusted research environment for general research access, we were thrilled to have the opportunity to partner with them to leverage Illumina Connected Analytics (ICA) for its large-scale data processing.

ICA is a modern, cloud-based genomics platform designed for population-scale sequencing and multi-biobank meta-analysis. It provides secure data access and a highly scalable compute environment that supports high concurrency and low-latency job scheduling. The platform includes both DRAGEN secondary analysis workflows and population genomics pipelines, which accelerates the transformation of raw genomic data into meaningful insights. For example, the UK Biobank 500K joint-calling project ran on ICA, completing 934,000 analyses over 8.6 million CPU hours in just 86 days.

More recently, ICA has enabled joint analyses across multiple biobanks involving over 1.2 million genomes, demonstrating its scalability and global impact.

MC:

A picture of Molly Coddington

Senior Writer and Newsroom Team Lead

Technology Networks

Looking ahead, how might Illumina’s informatics platforms evolve to handle not just bigger datasets, but also multi-omics integration (e.g., transcriptomics, proteomics and epigenomics) alongside genomics?

RM:

Rami Mehio.

Head of Software and Informatics Development

Illumina

Beyond scaling for larger datasets, Illumina’s informatics platforms are expanding to support multiomics analysis through advanced pipelines for quality control, clustering and association testing. This includes the availability of industry standard tools like CellBender and Seurat for QC, clustering, and differential expression analysis. It also supports differential methylation analysis and epigenome-wide association studies, enabling researchers to explore complex regulatory mechanisms.

Looking forward, large multiomics datasets will play a critical role in training foundational models in biology. Illumina is investing in AI-driven capabilities to support this evolution, including hosting and developing foundational models as they become available. These advancements position the platform to accelerate discovery across genomics, transcriptomics, epigenomics and beyond.

UK Biobank’s Whole-Genome Sequencing Uncovers 1.5B Variants

Tags: