The research team of Weihong Tan, Xiaohong Fang, and Tao Bing from the Hangzhou Institute of Medical Sciences, Chinese Academy of Sciences, proposed a new method for nucleic acid aptamer sequence analysis based on machine learning. This method can directly parse the secondary structure of nucleic acid aptamers from single-round screening data, thereby obtaining detailed secondary structure information of nucleic acid aptamers without iterative enrichment. This enables rational truncation and optimization of high-affinity nucleic acid aptamers, and even the design of nucleic acid aptamer molecules, significantly accelerating the discovery and optimization process of nucleic acid aptamers. The article was published as an open access Research Article in CCS Chemistry, the flagship journal of the Chinese Chemical Society.
Background information:
Nucleic acid aptamers, as nucleic acid molecules capable of specifically recognizing target molecules, exhibit high diversity and complex conformations in their secondary structures. Although SELEX technology generates a large number of candidate sequences, their functional secondary structures for binding to targets are difficult to determine. Furthermore, the nucleic acid aptamers obtained through screening may not be the optimal nucleic acid sequences, requiring further optimization. However, traditional structural characterization techniques such as electron microscopy, nuclear magnetic resonance, and X-ray crystallography are insufficient for efficiently resolving their structures, severely limiting the truncation optimization and molecular design of nucleic acid aptamers.
Highlights of this article:
To address these issues, the team established a machine learning-based analytical method (Figure 1). This method utilizes unsupervised autoencoder clustering and deep learning to analyze core sequences within the aptamer family through a single round of screening. Subsequently, using these core sequences as indices, machine learning strategies are employed to analyze massive amounts of secondary structures, thereby extracting common secondary structural features of nucleic acid aptamers. This strategy not only achieves rational truncation and performance optimization of nucleic acid aptamer sequences but also provides a novel approach for the de novo design and generation of specific nucleic acid aptamer sequences.
To obtain the common core sequence of nucleic acid aptamers from the single-round screening library, the authors used deep learning to systematically analyze the family distribution patterns in the single-round aptamer screening sequences of the CD8 protein (Figure 2). The results showed that although the sequence background was highly heterogeneous, most families were enriched with the same core sequence “GTGAGGAGCTTGAAA”. Traditional multiple sequence alignment methods failed to effectively extract this key homology sequence information because they are difficult to handle short motifs in a low homology background.
To verify the correctness of the core sequences obtained from the screening, the authors synthesized a library containing a portion of the core sequence (5′-AGCTTGAAA-3′) and performed RE-SILEX. All of the more than 20,000 nucleic acids obtained from the screening contained core sequences consistent with those obtained in the single round of screening. To clearly elucidate the structure-function relationship between aptamers and their targets, a machine learning-based algorithm was further developed to analyze the secondary structures mediated by the core sequences of aptamers in the single round of screening. In the fixed-region sequence “5′-AGCTTGAAA-3′”, 62.4% of the sequences (24,867 in total) formed stem-loop secondary structures, while the remaining 37.6% of the aptamers could form other secondary structures in the fixed region. Among the aptamers that formed stem-loop structures in the fixed-base region, the sequence “GTGA” appeared in 55.2% (13,711 copies) of multi-branched loops and in 44.8% (11,301 copies) of stem structures (Figure 3c). The authors quantified the length of individual stem-loops within multi-branched loops and analyzed the base distribution at each position. Through deep mining of the secondary structures of different sequences using machine learning, they inferred that these sequences would form a shared secondary structure capable of binding to the same epitope of the target (Figure 3d). Guided by this secondary structure, the authors truncated and optimized the nucleic acid sequences obtained from RE-SILEX, significantly improving their affinity. This means that over 10,000 potentially active CD8-specific aptamers were successfully obtained.
To further confirm the applicability of this method to analyzing single-round aptamer screening data, the authors analyzed CD8 single-round aptamer data using this method. Of the first 1000 sequences of CD8 protein single-round aptamers, 770 contained a core sequence, and the vast majority of sequences exhibited highly consistent shared secondary structure characteristics (Figure 4). Based on the shared secondary structure characteristics of the CD8 protein, these aptamers were truncated and optimized, successfully increasing their affinity by an order of magnitude, while maintaining high specificity and accurate recognition of the CD8 protein even in complex cellular microenvironments. Furthermore, based on the secondary structure characteristics of CD8 protein aptamers, not only can split-type aptamers be designed, but new possibilities for de novo aptamer design are also opened.
To verify the universality of the aforementioned machine learning analysis method, the authors further applied it to the single-round screening data analysis of fibroblast activation protein (FAP). The results showed that the FAP aptamer family also enriched a highly conserved core sequence (5′-GGGGTCTGCTTCGGATTGCGG-3′), suggesting that the core sequence may form a G-quadruplex structure, while its two end sequences can form hairpin structures to stabilize the G-quadruplex structure (Figure 5). Based on this shared secondary structure, the FAP protein aptamers were truncated and optimized, significantly improving their binding affinity. This successful application fully demonstrates the universality of the machine learning-enabled single-round aptamer analysis method in handling aptamers with different structural types.
Summary and Outlook:
In summary, this study demonstrates that single-round nucleic acid aptamer screening libraries already contain rich structural information, which was previously thought to require multiple rounds of screening. By combining high-throughput sequencing with machine learning, the authors developed a method capable of decoding the secondary structure of nucleic acid aptamers and precisely locating targets to bind to key conserved functional motifs. This technique achieves rational truncation and optimization of nucleic acid aptamers, increasing affinity by more than tenfold without relying on cumbersome experimental structural analysis. This work not only improves the efficiency of nucleic acid aptamer discovery but also challenges the traditional paradigm by emphasizing the dominant role of spatial conformation in molecular recognition. This research approach also opens new avenues for designing functional nucleic acids, exploring non-coding RNA-protein interactions, and developing AI-driven virtual screening platforms for nucleic acid aptamers, which will drive the rapid development of next-generation nucleic acid aptamer technologies for precision diagnosis and treatment.
This research was supported by the National Natural Science Foundation of China, the Zhejiang Provincial “Pioneer” and “Leading Goose” R&D Program, the Natural Science Foundation of Zhejiang Province, the National Health Commission Scientific Research Fund-Zhejiang Provincial Health and Wellness Major Science and Technology Program, and the Strategic Priority Research Program of the Chinese Academy of Sciences.
—
About the journal: CCS Chemistry is the Chinese Chemical Society’s flagship publication, established to serve as the preeminent international chemistry journal published in China. It is an English language journal that covers all areas of chemistry and the chemical sciences, including groundbreaking concepts, mechanisms, methods, materials, reactions, and applications. All articles are diamond open access, with no fees for authors or readers. More information can be found at https://www.chinesechemsoc.org/journal/ccschem .
About the Chinese Chemical Society: The Chinese Chemical Society (CCS) is an academic organization formed by Chinese chemists of their own accord with the purpose of uniting Chinese chemists at home and abroad to promote the development of chemistry in China. The CCS was founded during a meeting of preeminent chemists in Nanjing on August 4, 1932. It currently has more than 120,000 individual members and 184 organizational members. There are 7 Divisions covering the major areas of chemistry: physical, inorganic, organic, polymer, analytical, applied and chemical education, as well as 31 Commissions, including catalysis, computational chemistry, photochemistry, electrochemistry, organic solid chemistry, environmental chemistry, and many other sub-fields of the chemical sciences. The CCS also has 10 committees, including the Woman’s Chemists Committee and Young Chemists Committee. More information can be found at https://www.chinesechemsoc.org/ .
/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.