{"id":217506,"date":"2025-10-16T10:29:16","date_gmt":"2025-10-16T10:29:16","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/217506\/"},"modified":"2025-10-16T10:29:16","modified_gmt":"2025-10-16T10:29:16","slug":"from-genotype-to-phenotype-with-1086-near-telomere-to-telomere-yeast-genomes","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/217506\/","title":{"rendered":"From genotype to phenotype with 1,086 near telomere-to-telomere yeast genomes"},"content":{"rendered":"<p>Strain culture and DNA extraction<\/p>\n<p>We used a collection of S. cerevisiae isolates that were previously sequenced using short-read sequencing<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 19\" title=\"Peter, J. et al. Genome evolution across 1,011 Saccharomyces cerevisiae isolates. Nature 556, 339&#x2013;344 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR19\" id=\"ref-link-section-d889513e1835\" rel=\"nofollow noopener\" target=\"_blank\">19<\/a>. For each isolate, we obtained single colonies from frozen stock on solid YPD (1% yeast extract, 2% peptone, and 2% glucose) and cultured one colony per strain in 25\u2009ml of liquid YPD at 30\u2009\u00b0C under shaking (120\u2009rpm). After the culture reached saturation (approximately 1.5 days), the nuclear DNA was extracted from the cells using either a previously described protocol<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 69\" title=\"Tsouris, A., Brach, G., Friedrich, A., Hou, J. &amp; Schacherer, J. Diallel panel reveals a significant impact of low-frequency genetic variants on gene expression variation in yeast. Mol. Syst. Biol. 20, 362&#x2013;373 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR69\" id=\"ref-link-section-d889513e1839\" rel=\"nofollow noopener\" target=\"_blank\">69<\/a> or a Monarch HMW DNA Extraction Kit (New England Biolabs). The cells from the saturated culture were treated for 2\u2009h with zymolyase (1,000\u2009U\u2009ml\u22121) in 1\u2009M sorbitol to produce spheroplasts. The spheroplasts were then processed with the Monarch HMW DNA Extraction Kit. Samples with a DNA concentration higher than 30\u2009ng\u2009\u00b5l\u22121 were retained for DNA sequencing.<\/p>\n<p>Sequencing data<\/p>\n<p>Long reads sequencing data were obtained using Oxford Nanopore sequencing technology. The library was prepared according to the following protocol, using the Oxford Nanopore SQK-LSK109 and SQK-LSK114 kits. Genomic DNA fragments were repaired and 3\u2032-adenylated with the NEBNext FFPE DNA Repair Mix and the NEBNext Ultra II End Repair\/dA-Tailing Module (New England Biolabs). Sequencing adapters provided by Oxford Nanopore Technologies (Oxford Nanopore Technologies) were then ligated using the NEBNext Quick Ligation Module (NEB). After purification with AMPure XP beads (Beckmann Coulter), the library was mixed with the sequencing buffer (ONT) and the loading bead (ONT) and loaded on PromethION R9.4.1 and R10.4.1 flowcells. Basecalling was performed with guppy 5.0.16 (<a href=\"https:\/\/nanoporetech.com\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/nanoporetech.com<\/a>). To confirm the correspondence of novel long-read sequences with previously generated short-read sequences, we compared SNPs inferred from both types of data. Long and short reads were mapped independently on the reference genome using minimap2 v.2.24\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 70\" title=\"Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094&#x2013;3100 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR70\" id=\"ref-link-section-d889513e1862\" rel=\"nofollow noopener\" target=\"_blank\">70<\/a>) and bwa-mem2 v.2.2.1\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 71\" title=\"Vasimuddin, Md., Misra, S., Li, H. &amp; Aluru, S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In 2019 IEEE International Parallel and Distributed Processing Symposium 314&#x2013;324 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR71\" id=\"ref-link-section-d889513e1866\" rel=\"nofollow noopener\" target=\"_blank\">71<\/a>), respectively, and SNPs were inferred with longshot v.0.4.5\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 72\" title=\"Edge, P. &amp; Bansal, V. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nat. Commun. 10, 4660 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR72\" id=\"ref-link-section-d889513e1870\" rel=\"nofollow noopener\" target=\"_blank\">72<\/a>) and gatk v.4.5.0.0\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 73\" title=\"Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv &#010;                https:\/\/doi.org\/10.1101\/201178&#010;                &#010;               (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR73\" id=\"ref-link-section-d889513e1874\" rel=\"nofollow noopener\" target=\"_blank\">73<\/a>). The reference genome version R64-3-1 was downloaded as a fasta file from the Saccharomyces genome database<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 74\" title=\"Engel, S. R. et al. Saccharomyces Genome Database: advances in genome annotation, expanded biochemical pathways, and other key enhancements. Genetics 229, iyae185 (2025).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR74\" id=\"ref-link-section-d889513e1882\" rel=\"nofollow noopener\" target=\"_blank\">74<\/a> website (<a href=\"https:\/\/www.yeastgenome.org\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/www.yeastgenome.org<\/a>). We computed the pairwise distance between all samples based on short reads and long reads SNPs using plink v.1.9\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 75\" title=\"Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J. Hum. Genet. 81, 559&#x2013;575 (2007).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR75\" id=\"ref-link-section-d889513e1893\" rel=\"nofollow noopener\" target=\"_blank\">75<\/a>). Cases with unclear correspondence between short and long reads were discarded.<\/p>\n<p>Reads phasing<\/p>\n<p>Sequencing data from non-polyploid heterozygous samples with coverage higher than 20x were phased to obtain one read set for each haplotype. Long reads were mapped on the reference genome using minimap2 v.2.24\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 70\" title=\"Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094&#x2013;3100 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR70\" id=\"ref-link-section-d889513e1905\" rel=\"nofollow noopener\" target=\"_blank\">70<\/a>) with the option -ax map-ont. SNPs were called with longshot v.0.4.5\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 72\" title=\"Edge, P. &amp; Bansal, V. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nat. Commun. 10, 4660 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR72\" id=\"ref-link-section-d889513e1909\" rel=\"nofollow noopener\" target=\"_blank\">72<\/a>) &#8211;no_haps &#8211;min_cov 7 &#8211;min_alt_count 7 &#8211;min_alt_frac 0.2. Regions of loss of heterozygosity, defined as 50-kb windows containing fewer than 10 SNPs, were detected and removed from the phasing process. SNPs were phased using whatshap phase v.1.4\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 76\" title=\"Martin, M. et al. WhatsHap: fast and accurate read-based phasing. Preprint at bioRxiv &#010;                https:\/\/doi.org\/10.1101\/085050&#010;                &#010;               (2016).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR76\" id=\"ref-link-section-d889513e1913\" rel=\"nofollow noopener\" target=\"_blank\">76<\/a>), and each sequencing read was tagged HP1, HP2 or unassigned. Unassigned reads were downsampled at 50% coverage with filtlong v.0.2.1 (<a href=\"https:\/\/github.com\/rrwick\/Filtlong\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/github.com\/rrwick\/Filtlong<\/a>) &#8211;min_length 1000 &#8211;length_weight 10 &#8211;keep_percent 50 &#8211;min_mean_q 9 to maintain similar coverage to the phased reads. Read set for each haplotype was finally obtained by combining phased reads and unassigned reads.<\/p>\n<p>Genome assembly<\/p>\n<p>The genome assembly pipeline (Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">2<\/a>) was run with raw sequencing data of 1,027 samples with more than 10x sequencing coverage, in addition to phased sequencing data for 433 non-polyploid heterozygous samples. Sequencing data were systematically downsampled to 30x using filtlong &#8211;min_length 1000 &#8211;length_weight 10 &#8211;target_bases 360000000 &#8211;min_mean_q 9 and additionally to 40x when raw coverage was higher than 40x with &#8211;target_bases 480000000. Raw and downsampled sequencing reads were assembled with 3 genome assemblers: (1) Necat v.0.0.1_update20200803\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 77\" title=\"Chen, Y. et al. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nat. Commun. 12, 60 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR77\" id=\"ref-link-section-d889513e1935\" rel=\"nofollow noopener\" target=\"_blank\">77<\/a>); (2) Flye v.2.9\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 78\" title=\"Kolmogorov, M., Yuan, J., Lin, Y. &amp; Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540&#x2013;546 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR78\" id=\"ref-link-section-d889513e1939\" rel=\"nofollow noopener\" target=\"_blank\">78<\/a>); and (3) SMARTdenovo -c 1\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 79\" title=\"Liu, H. et al. SMARTdenovo: a de novo assembler using long noisy reads. Gigabyte 2021, 15 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR79\" id=\"ref-link-section-d889513e1943\" rel=\"nofollow noopener\" target=\"_blank\">79<\/a>) using reads cleaned with Necat. Redundancy within each genome assembly was removed by discarding contigs covered on more than 95% by other contigs of the draft assembly. Nuclear contigs were then selected by sequence similarity with a database of S. cerevisiae nuclear chromosomes built from 142 genome assemblies<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 7\" title=\"O&#x2019;Donnell, S. et al. Telomere-to-telomere assemblies of 142 strains characterize the genome structural landscape in Saccharomyces cerevisiae. Nat. Genet. 55, 1390&#x2013;1399 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR7\" id=\"ref-link-section-d889513e1951\" rel=\"nofollow noopener\" target=\"_blank\">7<\/a>, discarding chromosomes containing mitochondrial insertions. For each sample and each phased haplotype, the best genome assembly was selected with seven criteria chosen to favour completeness and contiguity: (1) each genome assembly must cover the reference genome over 95% of its length; (2) cover at least 80% of each reference chromosome (except for chromosome 1 for which the threshold was lowered to 75% because of a more variable size); (3) does not cover more than 50% of the mitochondrial genome; (4) does not contains fused chromosomes, identified as contigs containing multiple centromeres; (5) favours the lowest number of contigs required to cover 95% of the reference genome; (6) favours the lowest number of contigs; and finally (7) favours the largest total length. The second-best genome assembly, obtained with a different assembler, was also kept for further utilization in the SV detection pipeline. Genome assemblies were then polished with both long reads using medaka consensus -m r941_prom_sup_g507 1.8.0 (<a href=\"https:\/\/github.com\/nanoporetech\/medaka\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/github.com\/nanoporetech\/medaka<\/a>) and Illumina short reads using HapoG 1.3.3\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 80\" title=\"Aury, J.-M. &amp; Istace, B. Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads. NAR Genomics Bioinformatics 3, lqab034 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR80\" id=\"ref-link-section-d889513e1962\" rel=\"nofollow noopener\" target=\"_blank\">80<\/a>). For phased haplotypes, contigs of each haplotype were concatenated to perform the short reads polishing. Finally, scaffolding against the reference genome was performed using ragout &#8211;solid-scaffolds 2.3.1\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 81\" title=\"Kolmogorov, M., Raney, B., Paten, B. &amp; Pham, S. Ragout&#x2014;a reference-assisted assembly tool for bacterial genomes. Bioinformatics 30, i302&#x2013;i309 (2014).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR81\" id=\"ref-link-section-d889513e1966\" rel=\"nofollow noopener\" target=\"_blank\">81<\/a>). For cases for which the scaffolding generated fused chromosomes, the non-scaffolded genome assembly was retained. Assembly contigs were named and ordered according to their sequence similarity to reference chromosomes. For strain XTRA_FHL, whose sequencing data was contaminated by a Kluveromyces marxianus isolate, K. marxianus contigs were manually removed.<\/p>\n<p>Quality assessment and genome annotation<\/p>\n<p>Correctness of genome assemblies was evaluated with Merqury<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 82\" title=\"Rhie, A., Walenz, B. P., Koren, S. &amp; Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR82\" id=\"ref-link-section-d889513e1985\" rel=\"nofollow noopener\" target=\"_blank\">82<\/a>, and completeness was assessed with miniBusco<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 37\" title=\"Huang, N. &amp; Li, H. miniBUSCO: a faster and more accurate reimplementation of BUSCO. Preprint at bioRxiv &#010;                https:\/\/doi.org\/10.1101\/2023.06.03.543588&#010;                &#010;               (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR37\" id=\"ref-link-section-d889513e1989\" rel=\"nofollow noopener\" target=\"_blank\">37<\/a>. Gene prediction and detection of TEs, centromeres and subtelomeric elements were performed through the LRSDAY pipeline v.1.7.0\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 83\" title=\"Yue, J.-X. &amp; Liti, G. Long-read sequencing data analysis for yeasts. Nat. Protoc. 13, 1213&#x2013;1231 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR83\" id=\"ref-link-section-d889513e1993\" rel=\"nofollow noopener\" target=\"_blank\">83<\/a>). Telomeric sequences were identified across all assemblies using Telofinder<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 7\" title=\"O&#x2019;Donnell, S. et al. Telomere-to-telomere assemblies of 142 strains characterize the genome structural landscape in Saccharomyces cerevisiae. Nat. Genet. 55, 1390&#x2013;1399 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR7\" id=\"ref-link-section-d889513e1997\" rel=\"nofollow noopener\" target=\"_blank\">7<\/a>.<\/p>\n<p>SV detection<\/p>\n<p>SVs were detected by individually comparing the generated assemblies with the reference genome (SGD R64 genome assembly of strain S288c, GenBank ID: <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/assembly\/GCA_000146045.2\" rel=\"nofollow noopener\" target=\"_blank\">GCA_000146045.2<\/a>) using MUM&amp;Co v.3.8\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 84\" title=\"O&#x2019;Donnell, S. &amp; Fischer, G. MUM&amp;Co: accurate detection of all SV types through whole-genome alignment. Bioinformatics 36, 3242&#x2013;3243 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR84\" id=\"ref-link-section-d889513e2016\" rel=\"nofollow noopener\" target=\"_blank\">84<\/a>) with the -g 12000000 option. SV calling was run on 1,482 genome assemblies from 1,086 isolates (including 396 isolates with phased genome assemblies). The pipeline uses whole-genome alignments obtained via the MUMmer4 software<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 85\" title=\"Mar&#xE7;ais, G. et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR85\" id=\"ref-link-section-d889513e2020\" rel=\"nofollow noopener\" target=\"_blank\">85<\/a> to detect insertions, deletions, duplications, contractions, inversions, and reciprocal translocations exhibiting a size larger than 50\u2009bp. To be validated, SVs had to be detected in at least two independent assemblies. To avoid removing singleton SVs\u2014that is, those present in a single haplotype\u2014we considered an additional set of 1,329 \u2018second-best assemblies\u2019 for 959 isolates of our collection, obtained from an alternative assembler and that met the completeness quality threshold defined. A total of 2,811 single sample VCF files were obtained and merged into a single multisample VCF file. First, insertions, deletions, duplications, contractions, and inversions were merged using Jasmine v.1.1.5\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 86\" title=\"Kirsche, M. et al. Jasmine and Iris: population-scale structural variant comparison and analysis. Nat. Methods 20, 408&#x2013;417 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR86\" id=\"ref-link-section-d889513e2024\" rel=\"nofollow noopener\" target=\"_blank\">86<\/a>), which is based on an SV proximity graph that consider SV breakpoint position and length. Given that Jasmine\u2019s algorithm does not consider two breakpoints for a single SV, a custom merging strategy was developed for translocations, as these involve two distinct breakpoints in the genome. This strategy is based on the construction of a translocation graph, linking pairs of translocations with both breakpoints within a 10\u2009kb region. Each connected component of this graph was treated as a single SV and appended to the Jasmine\u2019s output.<\/p>\n<p>To minimize false positives, we retained only SVs detected in at least two genome assemblies, coming from different isolates, haplotypes, or genome assemblers. We then discarded the second-best assemblies from the VCF file. Finally, phased haplotypes from the same isolate were merged into phased heterozygous genotypes, resulting in the final SV VCF file containing 1,086 samples. For further analyses, we classified insertions and deletions as presence\u2013absence variants (PAVs), and duplications and contractions as CNVs, to avoid reference-biased terminology. Therefore, PAVs and indels are only distinguished by their sizes\u2014higher or lower than 50\u2009bp, respectively.<\/p>\n<p>Detection of Ty-related SVs<\/p>\n<p>The sequence of PAVs, CNVs and inversions were aligned to a Ty retrotransposon database using blast v.2.12.0\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 87\" title=\"Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR87\" id=\"ref-link-section-d889513e2039\" rel=\"nofollow noopener\" target=\"_blank\">87<\/a>) with the -dust no -perc_identity 95 options. The database was constructed from the sequences of the 48 Ty elements present in the reference genome in addition to the 4 solo LTRs sequences. SVs with more than 50% of their length covered were defined as Ty-related SVs.<\/p>\n<p>Structural diversity<\/p>\n<p>To quantify structural diversity, we adapted the classical formula for pairwise nucleotide diversity<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 88\" title=\"Nei, M. &amp; Li, W. H. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc. Natl Acad. Sci. USA 76, 5269&#x2013;5273 (1979).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR88\" id=\"ref-link-section-d889513e2051\" rel=\"nofollow noopener\" target=\"_blank\">88<\/a> to account for SVs. We defined the structural diversity \\({\\pi }_{{\\rm{SV}}}\\) as the average number of structural differences per site between two sequences within a population:<\/p>\n<p>$${\\pi }_{{\\rm{SV}}}=\\frac{n}{n\\,+\\,1}\\sum _{ij}{x}_{i}{x}_{j}{\\pi }_{ij}=\\frac{n}{n\\,+\\,1}\\mathop{\\sum }\\limits_{i=2}^{n}\\mathop{\\sum }\\limits_{j=1}^{i-1}2{x}_{i}{x}_{j}{\\pi }_{ij}$$<\/p>\n<p>where \\({x}_{i}\\) and \\({x}_{j}\\) are the frequencies of haplotypes \\(i\\) and \\(j\\), \\({\\pi }_{{ij}}\\) is the number of SV differences between the haplotypes i and j and \\(n\\) is the total number of haplotypes. Each SV, including PAVs, CNVs, inversions and translocations, was treated as a discrete event, regardless of its size. An SV was considered present in a window if it overlapped the window by at least 1\u2009bp for PAVs, CNVs, and inversions, or if a translocation breakpoint fell within the window.<\/p>\n<p>We computed \\({\\pi }_{{\\rm{SV}}}\\) for each type of SV individually on 10-kb sliding windows (1\u2009kb step). Outlier regions were defined as regions with \\({\\pi }_{{\\rm{SV}}}\\) greater than the third quartile plus 5 times the IQR for PAVs and translocations, plus 10 times the IQR for inversions, and plus 20 times the IQR for CNVs, to account for baseline variability. To detect regions associated with specific clades, we tested for the over-representation of SVs located in the region of interest in each clade using a two-sided fisher\u2019s exact test with FDR correction.<\/p>\n<p>SNP and indels detection<\/p>\n<p>SNPs and indels were detected for the 1,086 isolates based on the alignment of paired-end Illumina reads<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 7\" title=\"O&#x2019;Donnell, S. et al. Telomere-to-telomere assemblies of 142 strains characterize the genome structural landscape in Saccharomyces cerevisiae. Nat. Genet. 55, 1390&#x2013;1399 (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR7\" id=\"ref-link-section-d889513e2523\" rel=\"nofollow noopener\" target=\"_blank\">7<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 19\" title=\"Peter, J. et al. Genome evolution across 1,011 Saccharomyces cerevisiae isolates. Nature 556, 339&#x2013;344 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR19\" id=\"ref-link-section-d889513e2526\" rel=\"nofollow noopener\" target=\"_blank\">19<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 36\" title=\"Lee, T. J. et al. Extensive sampling of Saccharomyces cerevisiae in Taiwan reveals ecology and evolution of predomesticated lineages. Genome Res. 32, 864&#x2013;877 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR36\" id=\"ref-link-section-d889513e2529\" rel=\"nofollow noopener\" target=\"_blank\">36<\/a> to the reference genome. The reads were mapped to the reference genome using bwa-mem2 mem v.2.2.1\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 71\" title=\"Vasimuddin, Md., Misra, S., Li, H. &amp; Aluru, S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In 2019 IEEE International Parallel and Distributed Processing Symposium 314&#x2013;324 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR71\" id=\"ref-link-section-d889513e2533\" rel=\"nofollow noopener\" target=\"_blank\">71<\/a>) with default parameters and samtools sort v.1.15.1\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 89\" title=\"Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR89\" id=\"ref-link-section-d889513e2537\" rel=\"nofollow noopener\" target=\"_blank\">89<\/a>). The HaplotypeCaller command from gatk v.4.2.3.0\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 73\" title=\"Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv &#010;                https:\/\/doi.org\/10.1101\/201178&#010;                &#010;               (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR73\" id=\"ref-link-section-d889513e2541\" rel=\"nofollow noopener\" target=\"_blank\">73<\/a>) was used with option &#8211;emit-ref-confidence GVCF to generate single sample GVCF files. These files were then gathered into a single multisample vcf file using commands GenomicsDBImport and GenotypeGVCFs &#8211;include-non-variant-sites, following gatk\u2019s germline short variant discovery workflow (<a href=\"https:\/\/gatk.broadinstitute.org\/hc\/en-us\/articles\/360035535932-Germline-short-variant-discovery-SNPs-Indels\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/gatk.broadinstitute.org\/hc\/en-us\/articles\/360035535932-Germline-short-variant-discovery-SNPs-Indels<\/a>). Low-quality genotypes (DP\u2009&lt;\u200910 and GQ\u2009&lt;\u200920) were set to missing using bcftools v.1.18.1\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 89\" title=\"Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR89\" id=\"ref-link-section-d889513e2553\" rel=\"nofollow noopener\" target=\"_blank\">89<\/a>) with the +set-gt command. Sites with fewer than 99% informed genotypes and sites exhibiting excess of heterozygosity (ExcHet &gt; 0.99) were removed. Finally, SNPs and indels were separated into two vcf files using bcftools, and complex loci spanning both SNPs and indels were discarded.<\/p>\n<p>Neighbour-joining trees<\/p>\n<p>Neighbour-joining trees were constructed independently from SNPs and SV matrices (1,474,884 and 6,587 markers, respectively, for 1,086 isolates) using the R packages ape<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 90\" title=\"Paradis, E. &amp; Schliep, K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35, 526&#x2013;528 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR90\" id=\"ref-link-section-d889513e2568\" rel=\"nofollow noopener\" target=\"_blank\">90<\/a> and SNPRelate<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 91\" title=\"Zheng, X. et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28, 3326&#x2013;3328 (2012).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR91\" id=\"ref-link-section-d889513e2572\" rel=\"nofollow noopener\" target=\"_blank\">91<\/a>.<\/p>\n<p>Site frequency spectrum<\/p>\n<p>Annotations of SNPs and indels were obtained using SnpEff v.5.1\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 92\" title=\"Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly 6, 80&#x2013;92 (2012).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR92\" id=\"ref-link-section-d889513e2585\" rel=\"nofollow noopener\" target=\"_blank\">92<\/a>) with the -no-downstream -no-upstream options.<\/p>\n<p>Comparison of the number of SVs and SNPs per isolate<\/p>\n<p>We used a simple linear regression to model the relationship between the number of SNPs and SVs in each isolate. We further used the R package chisq.posthoc.test<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 93\" title=\"Beasley, T. M. &amp; Schumacker, R. E. Multiple regression approach to analyzing contingency tables: post hoc and planned comparison procedures. J. Exp. Educ. 64, 79&#x2013;93 (1995).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR93\" id=\"ref-link-section-d889513e2597\" rel=\"nofollow noopener\" target=\"_blank\">93<\/a> to test for clades and super clades that deviated from the linear relationship between the number of SNPs and SVs. The chisq.posthoc.test function was used with a matrix containing the mean number of SNPs and SVs for each clade, with the method\u2009=\u2009\u2018bonferroni\u2019 option.<\/p>\n<p>Gene-based pangenome<\/p>\n<p>The gene-based pangenome was built on the de novo annotated coding sequence (CDS) of genomes with a Merqury quality value (QV) superior to 40, as a lower QV is associated with an increased number of singleton gene families (present in a single isolate), potentially being false positive CDS (Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">11a<\/a>). After the Merqury QV filtering, we considered 762 genomes corresponding to 651 isolates (Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">12<\/a>) for the construction of the gene-based pangenome (Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">13<\/a>).<\/p>\n<p>First, we transferred the annotation of the reference CDS on the CDS identified de novo in the assemblies with a nucleotide sequence similarity search, using blastn v.2.12.0\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 87\" title=\"Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR87\" id=\"ref-link-section-d889513e2621\" rel=\"nofollow noopener\" target=\"_blank\">87<\/a>) with options -dust no -prec_identity 95 -strand plus. The reference annotations version R64-4-1 were downloaded from the Saccharomyces genome database<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 74\" title=\"Engel, S. R. et al. Saccharomyces Genome Database: advances in genome annotation, expanded biochemical pathways, and other key enhancements. Genetics 229, iyae185 (2025).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR74\" id=\"ref-link-section-d889513e2628\" rel=\"nofollow noopener\" target=\"_blank\">74<\/a> website (<a href=\"https:\/\/www.yeastgenome.org\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/www.yeastgenome.org<\/a>) in gff3 format. For each pair of de novo and reference CDS, the annotation was transferred when one of these cases was true: (1) the de novo CDS is covered by the reference CDS on more than 90% of its length; (2) the reference CDS is covered by the de novo one on more than 50% of its length; (3) the reference CDS is covered by the de novo one on more than 30% of its length and both start and end of the reference CDS are covered (that is, the alignment spans the first and last 10\u2009bp of the CDS). De novo CDS covering less than 80% of the reference sequence have been annotated as truncated. Additionally, de novo CDS were annotated as Truncated when their reference homologue was covered on less than 80% of their length and alignment did not cover the start and end of the CDS. However, this additional information is purely informative and was not considered for the pangenome construction. We then filtered out all the genes with a length inferior to 100\u2009bp.<\/p>\n<p>Second, we identified gene families with a graph-based strategy. We ran a nucleotide sequences similarity search on the 6,673 reference CDS (larger than 100\u2009bp) and 77,322 de novo CDS for which no annotation was transferred in the previous step, using blastn with the same options mentioned before. A graph was then built using CDS as nodes and sequence homology as edges, using the python package NetworkX<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 94\" title=\"Hagberg, A., Swart, P. J. &amp; Schult, D. A. Exploring Network Structure, Dynamics, and Function Using NetworkX. Office of Scientific and Technical Information &#010;                https:\/\/www.osti.gov\/biblio\/960616&#010;                &#010;               (U.S. Department of Energy, 2008).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR94\" id=\"ref-link-section-d889513e2642\" rel=\"nofollow noopener\" target=\"_blank\">94<\/a>. Homology was considered when an alignment between two CDS covered both at 50% of their length or either at 90%, or when a reference annotation was transferred. Connected components with a density lower than 0.4 were further split into Louvain\u2019s communities, and each component or community was then considered as a gene family. For each gene family, a representative sequence was chosen as the reference CDS with the highest degree in the family, or the de novo CDS with the highest degree when no reference CDS was present. Finally, CDS with less than 100\u2009bp of unique sequence in the pangenome (that is, fragments of sequences strictly identical between multiple CDS) were iteratively removed. Identical matches within the pangenome were identified using blastn with options -dust no -strand plus -wordsize 100 -penalty -10000 -ungapped.<\/p>\n<p>Third, to estimate gene presence\/absence within the 1,086 isolates, we used a sequencing depth-based approach. Illumina reads were mapped to the representative CDS of each gene family using bwa-mem2 mem v.2.2.1\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 71\" title=\"Vasimuddin, Md., Misra, S., Li, H. &amp; Aluru, S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In 2019 IEEE International Parallel and Distributed Processing Symposium 314&#x2013;324 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR71\" id=\"ref-link-section-d889513e2649\" rel=\"nofollow noopener\" target=\"_blank\">71<\/a>) with options -U 0 -L 0,0 -O 4,4 -T 20, which remove penalty for unpaired reads, reduce penalty for reads clipping and gap opening and lower the minimum score required. These relaxed parameters were chosen to prevent mapping issues due to diversity within gene families. Read depth over each CDS was calculated using samtools depth v.1.16.1\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 89\" title=\"Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR89\" id=\"ref-link-section-d889513e2653\" rel=\"nofollow noopener\" target=\"_blank\">89<\/a>) with option -aa. For each CDS in each isolate, a normalized depth was computed as the ratio of the median depth of the CDS (discarding non-unique fragment identified in the previous step) over the median depth of all sequences. Because a given normalized depth could have different meaning according to the ploidy of the isolate, we adjusted the normalized depth with the ploidy and a correcting factor x:<\/p>\n<p>$${\\rm{Gene}}\\;{\\rm{presence}}\\propto {{\\rm{Normalized\\; depth}}\\times {\\rm{p}}{\\rm{l}}{\\rm{o}}{\\rm{i}}{\\rm{d}}{\\rm{y}}}^{x}$$<\/p>\n<p>To identify the optimal value of x, we built a gold standard gene presence\/absence matrix considering 553 isolates and 5,792 genes for which both de novo annotated genome assemblies and transcriptomic data<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 20\" title=\"Caudal, &#xC9;. et al. Pan-transcriptome reveals a large accessory genome contribution to gene expression variation in yeast. Nat. Genet. 56, 1278&#x2013;1287 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR20\" id=\"ref-link-section-d889513e2729\" rel=\"nofollow noopener\" target=\"_blank\">20<\/a> were available. A gene was considered present in an isolate when it was: (1) annotated in the genome assembly; and (2) expressed at \u22652 transcripts per million (TPM) in the corresponding RNA-seq data. A gene was considered absent when it was: (1) not annotated; and (2) had expression &lt;2 TPM. All other cases were excluded to avoid ambiguity. This gold standard was solely used to evaluate the accuracy of gene presence\/absence calls from sequencing depth, by generating a precision-recall curve for various values of x (Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">11b<\/a>). The optimal performance (highest area under the precision-recall curve; Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">11c<\/a>) was achieved with x\u2009=\u20090.15. We further computed the precision and recall values using different thresholds (Supplementary Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">11d<\/a>) and chose a threshold of 0.3 as the recall declined sharply beyond this point. Using this threshold, we estimated the presence of all the gene families of the pangenome across the 1,086 isolates.<\/p>\n<p>Importantly, expression data was used solely to calibrate the sequencing depth-based model, and was not used to call gene presence or absence in any genome. Final gene presence\/absence calls were made entirely from DNA sequence data using mapped short-read depth.<\/p>\n<p>Pangenome annotation<\/p>\n<p>We sought to annotate the origin and function of novel genes with protein sequences similarity search against a curated database. We built a blast database by coupling the RefSeq protein database with a custom database containing Fungi protein sequences from Shen et al.<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 95\" title=\"Shen, X.-X. et al. Tempo and mode of genome evolution in the budding yeast subphylum. Cell 175, 1533&#x2013;1545.e20 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR95\" id=\"ref-link-section-d889513e2760\" rel=\"nofollow noopener\" target=\"_blank\">95<\/a> and S. paradoxus protein sequences from Yue et al.<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 96\" title=\"Yue, J.-X. et al. Contrasting evolutionary genome dynamics between domesticated and wild yeasts. Nat. Genet. 49, 913&#x2013;924 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR96\" id=\"ref-link-section-d889513e2767\" rel=\"nofollow noopener\" target=\"_blank\">96<\/a>. The sequence similarity search was run using blastp<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 87\" title=\"Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR87\" id=\"ref-link-section-d889513e2771\" rel=\"nofollow noopener\" target=\"_blank\">87<\/a> with default parameters, and the obtained results were further filtered with a minimum protein identity of 30% and a minimum query coverage of 50%. We categorized the origin of each novel gene as: (1) fast-evolving gene when the best hit was a S. cerevisiae protein; (2) introgression when the best hit was a Saccharomyces protein other than S. cerevisiae; (3) HGT when the protein came out of the Saccharomyces genus; and (4) unknown when no sequence similarity was found. We also transferred the gene ontology (GO) terms associated with the best protein hit in RefSeq to each novel gene (using the same identity and coverage filters as above) and inferred the GO terms of the whole pangenome based on sequence using InterProScan v.4.65-97.0\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 97\" title=\"Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236&#x2013;1240 (2014).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR97\" id=\"ref-link-section-d889513e2788\" rel=\"nofollow noopener\" target=\"_blank\">97<\/a>).<\/p>\n<p>Transcriptomics<\/p>\n<p>Reads were mapped on the CDS of the pangenome using STAR v.2.7.9.a<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 98\" title=\"Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15&#x2013;21 (2013).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR98\" id=\"ref-link-section-d889513e2800\" rel=\"nofollow noopener\" target=\"_blank\">98<\/a> with default parameters. Number of reads mapped to each gene family was retrieved using samtools idxstats v.1.16.1\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 89\" title=\"Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR89\" id=\"ref-link-section-d889513e2804\" rel=\"nofollow noopener\" target=\"_blank\">89<\/a>) and TPM were computed with a custom python script.<\/p>\n<p>Rarefaction curves<\/p>\n<p>For both SVs and gene families, rarefactions curves were obtained using the R package iNEXT v.3.0.0\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 99\" title=\"Hsieh, T. C., Ma, K. H. &amp; Chao, A. iNEXT: an R package for rarefaction and extrapolation of species diversity (Hill numbers). Methods Ecol. Evol. 7, 1451&#x2013;1456 (2016).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR99\" id=\"ref-link-section-d889513e2816\" rel=\"nofollow noopener\" target=\"_blank\">99<\/a>). The iNEXT function was used with a presence\u2013absence matrix of SVs or genes and the options datatype\u2009=\u2009\u2018incidence_raw\u2019 and k\u2009=\u2009400. We interpreted the species richness as the total number of SV or gene families in the species and the sample coverage estimate as the species coverage. For the core genome, we first used a matrix of missing genes (filled with 1 when the gene was absent and 0 when the gene was present) as input of the iNEXT function, with the same parameters as before. We then subtracted the rarefaction obtained from the missing genes to the species estimate of the pangenome (8,583 genes), to obtain the rarefaction of the core genome. Finally, we obtained the rarefaction of the accessory genome by subtracting the core genome rarefaction to the pangenome rarefaction.<\/p>\n<p>Clade-specific variants<\/p>\n<p>Clade-specific SVs and genes were obtained using a simple-over-representation analysis based on hypergeometric tests. We used the fora function from the fgsea R package v.1.27.0\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 100\" title=\"Korotkevich, G. et al. Fast gene set enrichment analysis. Preprint at bioRxiv &#010;                https:\/\/doi.org\/10.1101\/060012&#010;                &#010;               (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR100\" id=\"ref-link-section-d889513e2832\" rel=\"nofollow noopener\" target=\"_blank\">100<\/a>) using the list of clades and super clades as \u2018pathways\u2019, the isolates having the SV or gene as \u2018gene\u2019 and the total list of isolates as \u2018universe\u2019. The function was run for each SV or gene present in at least two isolates, except SV or genes present in all isolates. Results were further concatenated, and P value were adjusted using the FDR method.<\/p>\n<p>GO analysis<\/p>\n<p>We used the GO annotations of the reference genes available from SGD (<a href=\"https:\/\/current.geneontology.org\/annotations\/sgd.gaf.gz\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/current.geneontology.org\/annotations\/sgd.gaf.gz<\/a>), in addition to the transferred GO terms for the novel genes (see \u2018Pangenome annotation\u2019) and the GO terms inferred with InterProScan<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 97\" title=\"Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236&#x2013;1240 (2014).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR97\" id=\"ref-link-section-d889513e2854\" rel=\"nofollow noopener\" target=\"_blank\">97<\/a> for the entire pangenome. We discarded terms with a size larger than 500, as well as terms with a size smaller than 2 (except for terms with a reliable evidence code\u2014that is, inferred from mutant phenotype (IMP), inferred from direct assay (IDA) and inferred from genetic interaction (IGI)). For some analyses, we performed a GO term semantic similarity reduction using the calculateSimMatrix and reduceSimMatrix(threshold\u2009=\u20090.7) functions of the rrvgo R package v.1.10.0\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 101\" title=\"Sayols, S. rrvgo: a Bioconductor package for interpreting lists of Gene Ontology terms. microPublication Biol. &#010;                https:\/\/doi.org\/10.17912\/micropub.biology.000811&#010;                &#010;               (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR101\" id=\"ref-link-section-d889513e2858\" rel=\"nofollow noopener\" target=\"_blank\">101<\/a>). GO term enrichments were performed using the fora function from the fgsea v.1.27.0\u2009R package<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 100\" title=\"Korotkevich, G. et al. Fast gene set enrichment analysis. Preprint at bioRxiv &#010;                https:\/\/doi.org\/10.1101\/060012&#010;                &#010;               (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR100\" id=\"ref-link-section-d889513e2865\" rel=\"nofollow noopener\" target=\"_blank\">100<\/a>. Clade-specific GO terms were detected similarly to clade-specific genes.<\/p>\n<p>Construction of an exhaustive genotype matrix<\/p>\n<p>To build the most comprehensive genotype matrix possible, we combined SNPs and indels with SVs called from both genome assembly comparisons and gene-based pangenome construction. We transformed the normalized depth computed for each CDS\u2013isolate pair (see \u2018Gene-based pangenome\u2019) into biallelic variants by setting multiple depth thresholds (starting from 0.25 and increasing by steps of 0.5, as we would expect for a diploid isolate). We discriminated isolates having a normalized depth below or above each threshold for each CDS. In that way, we capture both the presence\u2013absence of each CDS in the population, in addition to the variation in copy number. The complete loss of a gene was considered as deletion.<\/p>\n<p>Although combining multiple strategies for SV calling ensures the comprehensiveness of our variant catalogue, it results in redundancy from the segmental SV detections (that is, assembly comparisons) and the gene-based CNV and deletion detection. For example, a deletion spanning three consecutive genes would result in one segmental deletion and the deletion of each individual gene. The four SV records would exhibit high linkage disequilibrium as they all correspond to a single event. Additionally, aneuploidies are captured by CNVs of all genes present on the aneuploid chromosome. Although it corresponds to a single event, it results in many gene\u2013CNV records, that exhibit high linkage disequilibrium with each other. This redundancy can be easily removed with linkage pruning to prevent duplicated genetic associations in further analyses.<\/p>\n<p>Heritability estimates<\/p>\n<p>Heritability estimates were computed for complex traits, defined as traits having no association with a P value lower than 1\u2009\u00d7\u200910\u221220 by GWAS, as high effect predictors may bias the estimations. LDAK v.4.2\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 102\" title=\"Speed, D., Hemani, G., Johnson, M. R. &amp; Balding, D. J. Improved heritability estimation from genome-wide SNPs. Am. J. Hum. Genet. 91, 1011&#x2013;1021 (2012).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR102\" id=\"ref-link-section-d889513e2893\" rel=\"nofollow noopener\" target=\"_blank\">102<\/a>) was used for the computation. Phenotypes were normalized using a rank-based inverse normal transformation. Plink matrices of SNPs, indels and SVs were first used to generate independent kinship matrices using the LDAK thin model. Weights for each variant were first computed using LDAK with arguments &#8211;thin &#8211;window-prune .98 &#8211;window-kb 20, and kinship were generated using arguments &#8211;calc-kins-direct &#8211;weights &#8211;power -.25. Trait heritability was then estimated using all three kinship matrices together (&#8211;mgrm option), using ploidy as covariate and the option &#8211;constrain YES to ensure positive values of heritability.<\/p>\n<p>GWAS<\/p>\n<p>We ran GWAS using a linear mixed model implemented in FaST-LMM v.0.4.6\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 57\" title=\"Lippert, C. et al. FaST linear mixed models for genome-wide association studies. Nat. Methods 8, 833&#x2013;835 (2011).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR57\" id=\"ref-link-section-d889513e2905\" rel=\"nofollow noopener\" target=\"_blank\">57<\/a>). Phenotypes were normalized in the same way as for heritability estimates. SNPs, indels and SVs were filtered for MAF at 5%. This MAF filtering retained 89,906 SNPs (6.4% of all SNPs), 2,415 indels (4.3%) and 7,708 SVs (10.7%). These allele frequency differences were not accounted for. Genotypes were pruned for linkage disequilibrium using plink v.1.9\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 75\" title=\"Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J. Hum. Genet. 81, 559&#x2013;575 (2007).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR75\" id=\"ref-link-section-d889513e2909\" rel=\"nofollow noopener\" target=\"_blank\">75<\/a>) with option &#8211;indep-pairwise 50\u2009kb 1 0.8, yielding 54,544 SNPs, 2,203 indels and 4,540 SVs. Variants were further combined in a single plink matrix, which was used as both kinship and test set for GWAS. To preclude the effect of ploidy and aneuploidies on the genetic associations, both were added as covariates. Ploidy was encoded as a numerical covariate and aneuploidies were encoded for each chromosome as \u22121, 0 or 1, representing loss, expected copy number or gain, respectively. To correct for the large number of variants tested, a trait-specific P value threshold was defined using a permutation test with 100 permutations and alpha\u2009=\u20090.05. In brief, for each trait, associations were run 100 times on permutated phenotypes retaining the lowest P value for each run. The P value threshold corresponds to the 5% quantile (that is, the fifth-lowest P value across permutations), corresponding to an FDR correction of 5%.<\/p>\n<p>Local variants were defined as located in a 25-kb region around the gene of interest or linked to a pruned variant located in this region. For translocations, both breakpoints were considered for the definition of local variants. To further account for linkage disequilibrium between associated variants, groups of linkage were identified, defined as connected components of variants associated with a same trait and in linkage disequilibrium (based on a 0.5 r2 threshold and a maximal physical distance of 50\u2009kb). For each group, only the variant with the lowest P value was retained, leading to the final number of 4,564 QTL.<\/p>\n<p>Graph construction and novel sequence detection<\/p>\n<p>We build a graph pangenome using the Minigraph-Cactus pipeline v.2.6.4\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 59\" title=\"Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. 42, 663&#x2013;673 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR59\" id=\"ref-link-section-d889513e2942\" rel=\"nofollow noopener\" target=\"_blank\">59<\/a>) with 500 haplotypes, including the reference genome and 499 genomes selected to represent a maximum number of SVs. We used the first graph generated by Minigraph<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 58\" title=\"Li, H., Feng, X. &amp; Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 265 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR58\" id=\"ref-link-section-d889513e2946\" rel=\"nofollow noopener\" target=\"_blank\">58<\/a>, that uniquely contains SVs, in order to identify repetitive reference segments in the graph and novel sequences. Only segments larger than 100\u2009bp were considered for these analyses. The segments were mapped to the reference genome using minimap2 -ax asm5 v.2.24\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 70\" title=\"Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094&#x2013;3100 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR70\" id=\"ref-link-section-d889513e2950\" rel=\"nofollow noopener\" target=\"_blank\">70<\/a>) and the coverage depth along the genome was retrieved using samtools depth v.1.16.1\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 89\" title=\"Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR89\" id=\"ref-link-section-d889513e2954\" rel=\"nofollow noopener\" target=\"_blank\">89<\/a>). To determine the fraction of the graph corresponding to TEs, the segments were aligned on a Ty sequence database constructed from the previous analyses of 1,011 genomes<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 103\" title=\"Bleykasten-Grosshans, C., Fabrizio, R., Friedrich, A. &amp; Schacherer, J. Species-wide transposable element repertoires retrace the evolutionary history of the Saccharomyces cerevisiae host. Mol. Biol. Evol. 38, 4334&#x2013;4345 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR103\" id=\"ref-link-section-d889513e2958\" rel=\"nofollow noopener\" target=\"_blank\">103<\/a>, using blastn v.2.12.0\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 87\" title=\"Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR87\" id=\"ref-link-section-d889513e2963\" rel=\"nofollow noopener\" target=\"_blank\">87<\/a>), with option -perc_identity 70. The same alignment was performed on each of the genome assemblies used for the graph pangenome construction. Additionally, we sought for sequence redundancy in the graph using blastn -perc_identity 95 and applied a 50% coverage threshold. We built a sequence similarity graph and selected a single representative segment for each connected component. Components containing no reference segments were considered as novel sequences. Novel segments were considered as introgression when they show sequence similarity higher than 95% on a database composed of Saccharomyces genomes (GCF_000292725.1, GCF_001298625.1, GCF_002079055.1, GCF_947241705.1, GCF_947243775.1, GCA_002079085.1, GCA_002079115.1, GCA_002079145.1, GCA_002079175.1) using blastn -perc_identity 95.<\/p>\n<p>Variant calling using the graph<\/p>\n<p>The pangenome graph (gfa format) was converted to gbz using the vg toolkit v.1.54.0\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 104\" title=\"Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875&#x2013;879 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR104\" id=\"ref-link-section-d889513e2979\" rel=\"nofollow noopener\" target=\"_blank\">104<\/a>) with the vg autoindex command, and snarls were detected using vg snarls. Illumina reads of 3,039 isolates with publicly available sequencing data were mapped on the graph using vg giraffe<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 105\" title=\"Sir&#xE9;n, J. et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR105\" id=\"ref-link-section-d889513e2983\" rel=\"nofollow noopener\" target=\"_blank\">105<\/a> with options &#8211;fragment-mean 350 &#8211;fragment-stdev 100 -b fast. Gam files were converted using vg pack with option -Q 5 to remove reads with low mapping quality. We performed variant calling using vg call<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 106\" title=\"Hickey, G. et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 35 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR106\" id=\"ref-link-section-d889513e2987\" rel=\"nofollow noopener\" target=\"_blank\">106<\/a> with options &#8211;genotype-snarls &#8211;all-snarls &#8211;snarls &#8211;ref-sample S288c to obtain single sample vcf files. Calling from the graph pangenome worked for 2,874 out of 3,039 isolates, the remaining ones being discarded because of an aberrantly long computing time. Single sample vcf files were merged into a single multisample vcf using bcftools merge v.1.16.1\u00a0(ref.\u00a0<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 89\" title=\"Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#ref-CR89\" id=\"ref-link-section-d889513e2991\" rel=\"nofollow noopener\" target=\"_blank\">89<\/a>). Variants supported by less than two reads were set to missing using bcftools +setGT &#8212; -t q -n. -e \u2018FMT\/DP\u2009&gt;\u2009=\u20092\u2019. The resulting vcf was further trimmed for non-present alternate alleles with bcftools view &#8211;trim-alt-alleles and variants were atomized into multiple ones with bcftools norm &#8211;atomize &#8211;atom-overlap \u201c.\u201d &#8211;multiallelics +any. The difference of length between alternative and reference alleles were used to classify alleles as SNPs, indels of SVs. SNPs show no length difference, indels have length differences inferior to 50\u2009bp and SVs show length differences larger or equal to 50\u2009bp.<\/p>\n<p>Reporting summary<\/p>\n<p>Further information on research design is available in the\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41586-025-09637-0#MOESM2\" rel=\"nofollow noopener\" target=\"_blank\">Nature Portfolio Reporting Summary<\/a> linked to this article.<\/p>\n","protected":false},"excerpt":{"rendered":"Strain culture and DNA extraction We used a collection of S. cerevisiae isolates that were previously sequenced using&hellip;\n","protected":false},"author":2,"featured_media":217507,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[25],"tags":[64,63,17818,336,8054,1320,1321,128],"class_list":{"0":"post-217506","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-genetics","8":"tag-au","9":"tag-australia","10":"tag-functional-genomics","11":"tag-genetics","12":"tag-genomics","13":"tag-humanities-and-social-sciences","14":"tag-multidisciplinary","15":"tag-science"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/217506","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=217506"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/217506\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/217507"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=217506"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=217506"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=217506"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}