{"id":546011,"date":"2026-04-23T07:13:29","date_gmt":"2026-04-23T07:13:29","guid":{"rendered":"https:\/\/www.newsbeep.com\/uk\/546011\/"},"modified":"2026-04-23T07:13:29","modified_gmt":"2026-04-23T07:13:29","slug":"molecular-deep-learning-at-the-edge-of-chemical-space","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/uk\/546011\/","title":{"rendered":"Molecular deep learning at the edge of chemical space"},"content":{"rendered":"<p>Data preprocessing and analysisMolecular representation and description<\/p>\n<p>Each molecular structure in this study was represented as a SMILES<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 42\" title=\"Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31&#x2013;36 (1988).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR42\" id=\"ref-link-section-d95382669e2808\" rel=\"nofollow noopener\" target=\"_blank\">42<\/a> string. CATS<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 56\" title=\"Reutlinger, M. et al. Chemically Advanced Template Search (CATS) for scaffold-hopping and prospective target prediction for &#x2018;orphan&#x2019; molecules. Mol. Inform. 32, 133&#x2013;138 (2013).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR56\" id=\"ref-link-section-d95382669e2812\" rel=\"nofollow noopener\" target=\"_blank\">56<\/a> descriptors and 2048-bit ECFPs<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 53\" title=\"Rogers, D. &amp; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742&#x2013;754 (2010).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR53\" id=\"ref-link-section-d95382669e2816\" rel=\"nofollow noopener\" target=\"_blank\">53<\/a> (using a radius of 2) were computed for each molecule.<\/p>\n<p>Data collection and curation<\/p>\n<p>Thirty-three labelled datasets of molecular structures with their corresponding experimental target property were used:<\/p>\n<p>Moreover, small molecules were collected from ChEMBLv33<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 45\" title=\"Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100&#x2013;D1107 (2012).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR45\" id=\"ref-link-section-d95382669e2888\" rel=\"nofollow noopener\" target=\"_blank\">45<\/a> for model pretraining. Because molecular structures from ChEMBL were used for pretraining, molecules with a Bemis\u2013Murcko scaffold<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 76\" title=\"Bemis, G. W. &amp; Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887&#x2013;2893 (1996.\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR76\" id=\"ref-link-section-d95382669e2892\" rel=\"nofollow noopener\" target=\"_blank\">76<\/a> similar to any such scaffold in the labelled datasets (Tanimoto similarity coefficient<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 54\" title=\"Cereto-Massagu&#xE9;, A. et al. Molecular fingerprint similarity search in virtual screening. Methods 71, 58&#x2013;63 (2015).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR54\" id=\"ref-link-section-d95382669e2896\" rel=\"nofollow noopener\" target=\"_blank\">54<\/a> on EFCPs larger than 0.7) were removed. This included molecules without a Bemis\u2013Murcko scaffold (that is, containing no rings).<\/p>\n<p>For prospective virtual screening, the most recent CDK1 (CHEMBL308) data were fetched from ChEMBL v35<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 45\" title=\"Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100&#x2013;D1107 (2012).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR45\" id=\"ref-link-section-d95382669e2903\" rel=\"nofollow noopener\" target=\"_blank\">45<\/a> (accessed in April 2025) as an additional and independent dataset. Raw data were processed in accordance with previous work<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 48\" title=\"van Tilborg, D., Alenicheva, A. &amp; Grisoni, F. Exposing the limitations of molecular machine learning with activity cliffs. J. Chem. Inf. Model. 62, 5938&#x2013;5951 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR48\" id=\"ref-link-section-d95382669e2907\" rel=\"nofollow noopener\" target=\"_blank\">48<\/a>.<\/p>\n<p>All bioactivity endpoints of the MoleculeACE<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 48\" title=\"van Tilborg, D., Alenicheva, A. &amp; Grisoni, F. Exposing the limitations of molecular machine learning with activity cliffs. J. Chem. Inf. Model. 62, 5938&#x2013;5951 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR48\" id=\"ref-link-section-d95382669e2914\" rel=\"nofollow noopener\" target=\"_blank\">48<\/a> and CDK1 datasets were converted from continuous regression labels into binary classification labels. Molecules with an EC50 (half maximal effective concentration) or Ki (inhibitory constant) of 100\u2009nM or lower were labelled as bioactive, whereas less potent molecules were labelled as inactive. For the LIT-PCBA<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 49\" title=\"Tran-Nguyen, V.-K., Jacquemard, C. &amp; Rognan, D. LIT-PCBA: an unbiased data set for machine learning and virtual screening. J. Chem. Inf. Model. 60, 4263&#x2013;4273 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR49\" id=\"ref-link-section-d95382669e2926\" rel=\"nofollow noopener\" target=\"_blank\">49<\/a> and Ames mutagenicity dataset<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 50\" title=\"Hansen, K. et al. Benchmark data set for in silico prediction of Ames mutagenicity. J. Chem. Inf. Model. 49, 2077&#x2013;2081 (2009).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR50\" id=\"ref-link-section-d95382669e2930\" rel=\"nofollow noopener\" target=\"_blank\">50<\/a>, their original binary classification labels were used.<\/p>\n<p>For all datasets, SMILES strings were preprocessed using RDKit v. 2024.3.3<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 77\" title=\"Landrum, G. RDKit: Open-Source Cheminformatics. RDKit &#010;                https:\/\/www.rdkit.org&#010;                &#010;               (2006).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR77\" id=\"ref-link-section-d95382669e2938\" rel=\"nofollow noopener\" target=\"_blank\">77<\/a>. For each SMILES string, stereochemistry tokens as well as salts and solvents (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">2<\/a>) were removed. Each molecule was sanitized, neutralized using predefined neutralization reactions (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">3<\/a>), and its SMILES string was canonicalized. Disconnected structures and molecules that contained formal charges, contained complex ring systems (SMILES strings with a ring index of 9 or higher), non-standard isotopes or any atoms other than Cl, Br, H, C, N, O, F, S and I were removed. Molecules were removed if they contained more than 100 tokens in their canonical SMILES string or if they could not be featurized into CATS descriptors and\/or ECFP fingerprints. Sizes of datasets before and after data curation are presented in Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>.<\/p>\n<p>Virtual screening library (retrospective)<\/p>\n<p>Molecules were gathered from three commercial screening libraries:<\/p>\n<p>All molecules were processed in the same way as the training datasets. Finally, all unique molecules were aggregated.<\/p>\n<p>Virtual screening library (prospective)<\/p>\n<p>The most recent Specs library was downloaded from <a href=\"https:\/\/www.specs.net\/index.php?view=databases&amp;page=download\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/www.specs.net\/index.php?view=databases&amp;page=download<\/a> (accessed in April 2025). All molecules were processed in the same way as the training datasets. To ensure that molecules were compatible with our experimental assay (likely soluble in 1% dimethylsulfoxide (DMSO)), physicochemical rules were enforced based on relaxed rule-of-five and Veber criteria:<\/p>\n<p>                    (1)<\/p>\n<p>Molecular weight between 200\u2009g\u2009mol\u22121 and 600\u2009g\u2009mol\u22121.<\/p>\n<p>                    (2)<\/p>\n<p>logP lower than 6.<\/p>\n<p>                    (3)<\/p>\n<p>Total polar surface area between 20\u2009\u00c52 and 140\u2009\u00c52.<\/p>\n<p>                    (4)<\/p>\n<p>Number of hydrogen bond donors lower than six.<\/p>\n<p>                    (5)<\/p>\n<p>Number of rotatable bonds lower than ten.<\/p>\n<p>                    (6)<\/p>\n<p>A maximum of two rule-of-five violations.<\/p>\n<p>In addition, molecules with a terminal enone, isocyanate, quinone, aromatic nitro groups, azide or epoxide groups were removed by using SMILES Arbitrary Target Specification (SMARTS) patterns to prevent assay interference. Furthermore, to prevent the selection of trivially simple molecules and enrich the general-purpose Specs library for kinase-relevant chemical space, several general criteria were enforced:<\/p>\n<p>                    (1)<\/p>\n<p>A molecule must have an ATP mimetic core, that is, at least one heteroatom in a ring, or a fused carbocyclic system.<\/p>\n<p>                    (2)<\/p>\n<p>A molecule must have a polar anchor to ensure solubility or solvent interaction.<\/p>\n<p>                    (3)<\/p>\n<p>A molecule must have enough hydrophobic mass or planarity to fill the kinase ATP pocket (for example, gatekeeper region).<\/p>\n<p>                    (4)<\/p>\n<p>A molecule must have a directional H-bond donor\/acceptor to ensure potential for hinge interaction.<\/p>\n<p>                    (5)<\/p>\n<p>A molecule must have two or more rings.<\/p>\n<p>These rules are intentionally permissive for new chemotypes, and 98.6% of all kinase inhibitors that ever went into clinical trials<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 78\" title=\"Carles, F., Bourg, S., Meyer, C. &amp; Bonnet, P. PKIDB: a curated, annotated and updated database of protein kinase inhibitors in clinical trials. Molecules 23, 908 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR78\" id=\"ref-link-section-d95382669e3173\" rel=\"nofollow noopener\" target=\"_blank\">78<\/a> pass these filters. Finally, molecules with a Tanimoto similarity on ECFPs &gt;0.7 to any molecule in the respective target\u2019s data (PIM1, and CDK1) were removed. This left 185,298 and 185,336 screening molecules for PIM1 and CDK1, respectively.<\/p>\n<p>Molecular cyclic skeletons<\/p>\n<p>Cyclic skeletons<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 51\" title=\"Manelfi, C. et al. &#x2018;Molecular anatomy&#x2019;: a new multi-dimensional hierarchical scaffold analysis tool. J. Cheminform 13, 54 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR51\" id=\"ref-link-section-d95382669e3186\" rel=\"nofollow noopener\" target=\"_blank\">51<\/a> (core ring systems without exocyclic substituents) were extracted from molecules to serve as the most fundamental molecular scaffold representation. From each molecule, Bemis\u2013Murcko scaffolds<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 76\" title=\"Bemis, G. W. &amp; Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887&#x2013;2893 (1996.\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR76\" id=\"ref-link-section-d95382669e3190\" rel=\"nofollow noopener\" target=\"_blank\">76<\/a> were obtained, removing peripheral substituents. Remaining double-bonded exocyclic substituents were then removed, and all atoms and bonds were made generic to obtain the final cyclic skeleton.<\/p>\n<p>Data splitting<\/p>\n<p>The curated molecular structures from ChEMBL were split into a training (80%, n\u2009=\u20091,230,041), a test set (10%, n\u2009=\u2009153,755) and a validation (10%, n\u2009=\u2009153,755) set, using random splitting. All labelled datasets were split into a training set (~50%), test set (~25%) and OOD set (~25%). To determine the OOD molecules, spectral clustering was performed on unique cyclic skeletons (see below). The molecules corresponding to the scaffolds in the clusters with the lowest mean cluster similarity to all other clusters that constituted approximately 25% of the total dataset size were taken as the OOD set. The remaining molecules (that is, the ~75% most similar molecules) were split randomly in a train and test set, with the test set being equal in size to the OOD set. An overview of all datasets is presented in Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>.<\/p>\n<p>Spectral clustering<\/p>\n<p>Spectral clustering was performed on a molecular similarity matrix A using Sci-kit learn<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 79\" title=\"Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825&#x2013;2830 (2011).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR79\" id=\"ref-link-section-d95382669e3226\" rel=\"nofollow noopener\" target=\"_blank\">79<\/a>. A is an n\u2009\u00d7\u2009n matrix where each element Aij is the Tanimoto coefficient on ECFPs Tij between two molecular structures. By using a molecular similarity matrix directly, we bypass the complex, high-dimensional and non-Euclidean nature of molecular structures. From this affinity matrix, the symmetrically normalized Laplacian was constructed as follows:<\/p>\n<p>$${L}_{\\mathrm{sym}}=I-{D}^{-1\/2}A{D}^{-1\/2},$$<\/p>\n<p>\n                    (3)\n                <\/p>\n<p>where I is the identity matrix and D is the degree matrix. Subsequently, eigenvalue decomposition was performed:<\/p>\n<p>$${L}_{\\mathrm{sym}}={\\bf{U}}\\Lambda {{\\bf{U}}}^{T}{.}$$<\/p>\n<p>\n                    (4)\n                <\/p>\n<p>To determine the number of clusters k for spectral clustering, the eigenvalues \u03bb1, \u03bb2, \u2026, \u03bbn were sorted in ascending order and the elbow (the point of maximal curvature) of the resulting sequence was estimated using the kneed algorithm<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 80\" title=\"Satopaa, V., Albrecht, J., Irwin, D. &amp; Raghavan, B. Finding a &#x2018;kneedle&#x2019; in a haystack: detecting knee points in system behavior. In Proc. 31st International Conference on Distributed Computing Systems Workshops (IEEE, 2011).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR80\" id=\"ref-link-section-d95382669e3395\" rel=\"nofollow noopener\" target=\"_blank\">80<\/a>. Finally, the spectral embeddings of the data were clustered by taking the top k (smallest) eigenvectors Uk, normalizing Uk to unit length and performing k-means clustering on all rows (uk).<\/p>\n<p>Chemical space visualization<\/p>\n<p>To visualize each labelled dataset, molecules were first encoded as ECFPs. The resulting binary ECFPs were reduced to 100 components using truncated singular value decomposition and embedded into a two-dimensional space using t-distributed stochastic neighbour embedding with a perplexity value of 30 and default settings.<\/p>\n<p>MCS fraction<\/p>\n<p>To compute molecular core similarity, we computed the MCS fraction<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 55\" title=\"Rossen, L., Sirockin, F., Schneider, N. &amp; Grisoni, F. Scaffold hopping with generative reinforcement learning. J. Chem. Inf. Model. 65, 6513&#x2013;6525 (2025).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR55\" id=\"ref-link-section-d95382669e3442\" rel=\"nofollow noopener\" target=\"_blank\">55<\/a> between a molecule Ma and a reference molecule Mb as<\/p>\n<p>$$\\mathrm{MCS}\\,\\mathrm{fraction}=\\frac{\\left|\\mathrm{MCS}\\left({M}_{a},\\,{M}_{b}\\right)\\right|}{\\left|{M}_{a}\\right|},$$<\/p>\n<p>\n                    (5)\n                <\/p>\n<p>where |Ma| is the number of atoms in Ma and MCS is the maximal common substructure between the two molecules Ma and Mb, as determined by the FMCS algorithm<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 7\" title=\"Yera, E. R., Cleves, A. E. &amp; Jain, A. N. Chemical structural novelty: on-targets and off-targets. J. Med. Chem. 54, 6771&#x2013;6785 (2011).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR7\" id=\"ref-link-section-d95382669e3555\" rel=\"nofollow noopener\" target=\"_blank\">7<\/a> in RDKit. A high MCS fraction indicates that a molecule shares a significant portion of their overall core structure with a reference molecule. This implementation is asymmetric.<\/p>\n<p>Molecular complexity<\/p>\n<p>To quantify molecular complexity we compute the well-established Bertz complexity<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 81\" title=\"Bertz, S. H. The first general index of molecular complexity. J. Am. Chem. Soc. 103, 3599&#x2013;3601 (1981).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR81\" id=\"ref-link-section-d95382669e3568\" rel=\"nofollow noopener\" target=\"_blank\">81<\/a> and B\u00f6ttcher complexity<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 82\" title=\"B&#xF6;ttcher, T. An additive definition of molecular complexity. J. Chem. Inf. Model. 56, 462&#x2013;470 (2016).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR82\" id=\"ref-link-section-d95382669e3572\" rel=\"nofollow noopener\" target=\"_blank\">82<\/a>. In addition, to align complexity measures with the task of reconstructing SMILES strings, we also estimated the complexity of the molecular graph directly<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 83\" title=\"Bonchev, D., Mekenyan, O. V. &amp; Trinajsti&#x107;, N. Isomer discrimination by topological information approach. J. Comput. Chem. 2, 127&#x2013;148 (1981).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR83\" id=\"ref-link-section-d95382669e3576\" rel=\"nofollow noopener\" target=\"_blank\">83<\/a> as<\/p>\n<p>$${C}_{\\mathrm{graph}}=V{\\log }_{2}V-\\mathop{\\sum }\\limits_{i\\in {\\mathscr{G}}}{V}_{i}{\\log }_{2}{p}_{i},$$<\/p>\n<p>\n                    (6)\n                <\/p>\n<p>where G represents the molecular graph, V is the total number of elements in the graph and Vi is the number of the elements in the ith set of elements. In a similar fashion, we also estimated the complexity of a SMILES string, via their entropy computed as<\/p>\n<p>$${C}_{\\mathrm{SMILES}}=-\\mathop{\\sum }\\limits_{i\\in S}{p}_{i}{\\log }_{2}{p}_{i},$$<\/p>\n<p>\n                    (7)\n                <\/p>\n<p>where S represents the set of unique tokens in a SMILES string, and pi is the probability of the ith token occurring in S. Tokens representing the start, end of sequence, and padding were not considered. Moreover, for each molecule, we counted the number of SMILES tokens (excluding padding), the number of SMILES string branches (that is, \u2018(\u2018 tokens), and the presence of 50 unique molecular patterns<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 74\" title=\"van Tilborg, D. &amp; Grisoni, F. Traversing chemical space with active deep learning for low-data drug discovery. Nat. Comput. Sci. 4, 786&#x2013;796 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR74\" id=\"ref-link-section-d95382669e3815\" rel=\"nofollow noopener\" target=\"_blank\">74<\/a>.<\/p>\n<p>Virtual screening<\/p>\n<p>Top-k molecules were selected in a multi-objective manner by selecting the k molecules closest to the utopia point<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 68\" title=\"Gunantara, N. A review of multi-objective optimization: methods and its applications. Cogent Eng. 5, 1502242 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR68\" id=\"ref-link-section-d95382669e3834\" rel=\"nofollow noopener\" target=\"_blank\">68<\/a>. The distance to this ideal point can be calculated as<\/p>\n<p>$${d}_{\\mathrm{utopia}}=\\sqrt{\\mathop{\\sum }\\limits_{i=1}^{n}{\\left({\\mathrm{norm}}_{i}\\right)}^{2}},$$<\/p>\n<p>\n                    (8)\n                <\/p>\n<p>where normi is the normalized objective that is either maximized (for example, predicted bioactivity) or minimized (for example, prediction uncertainty):<\/p>\n<p>$$\\genfrac{}{}{0ex}{}{{{\\rm{norm}}}_{i}}{[\\text{max}]}=\\frac{{x}_{\\text{max}}-{x}_{i}}{{x}_{\\text{max}}-{x}_{\\text{min}}}{\\rm{or}}\\genfrac{}{}{0ex}{}{{{\\rm{norm}}}_{i}}{[\\text{min}]}=\\frac{{x}_{i}-{x}_{\\text{min}}}{{x}_{\\text{max}}-{x}_{\\text{min}}}.$$<\/p>\n<p>\n                    (9)\n                <\/p>\n<p>Machine learningEncoder<\/p>\n<p>Canonical SMILES strings were encoded by a one-dimensional convolutional neural network. SMILES string character tokens were embedded using a randomly initialized trainable embedding layer of size 128. Several one-dimensional convolutional layers were used with a stride of 1 and no padding. Each layer was followed by a ReLU activation, standard max pooling with a kernel size equal to that of the convolutional layers, and dropout. Both convolutional and pooling layers used a stride of 1 and no padding. The final output was flattened and compressed to a latent vector (z) of size 128 using a fully connected layer. The following hyperparameters were optimized (see \u2018Hyperparameter optimization\u2019 section): the number of convolutional layers [2, 3], filter size [256, 512], kernel size [6, 8], weight decay on convolutional neural network weights [0, 1 \u00d7 10\u22124] and dropout [0, 0.1].<\/p>\n<p>Decoder<\/p>\n<p>Encoded latent molecular representations in z were reconstructed back to SMILES strings using a conditioned LSTM model. A randomly initialized trainable embedding layer of 128 neurons was used to embed SMILES string character tokens. The following hyperparameters were optimized (see \u2018Hyperparameter optimization\u2019 section): the number of LSTM layers (nlayers) [2, 3] and the LSTM hidden size (\\({\\mathrm{size}}_{\\mathrm{layers}}\\)) [256, 512]. Models were trained autoregressively without teacher forcing using next token prediction based on the tokens predicted in the previous steps rather than the ground truth tokens. A reconstruction loss normalized for sequence length was used (equation (<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"equation anchor\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#Equ1\" rel=\"nofollow noopener\" target=\"_blank\">1<\/a>)). To condition the model, the LSTM hidden state h0 was initialized with z for every molecule. To correctly match all dimensions of h0 (\\({n}_{\\mathrm{layers}},{\\mathrm{size}}_{\\mathrm{layers}}\\)), z was first transformed to \\({n}_{\\mathrm{layers}}\\,\\times {\\mathrm{size}}_{\\mathrm{layers}}\\) with a fully connected layer, after which it was reshaped into \\({{\\rm{n}}}_{\\mathrm{layers}}\\) chunks of \\({\\mathrm{size}}_{\\mathrm{layers}}\\).<\/p>\n<p>Approximate Bayesian classifier<\/p>\n<p>Labels were predicted from either ECFPs or SMILES strings encoded into latent molecular representationsz. A MLP was used with several fully connected layers and an output layer consisting of two neurons. The number of MLP layers [2, 3] and the MLP hidden size [1,024, 2,048] were optimized (see \u2018Hyperparameter optimization\u2019 section). To estimate prediction uncertainty, anchored ensembling<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 47\" title=\"Pearce, T., Leibfried, F. &amp; Brintrup, A. Uncertainty in neural networks: approximately Bayesian ensembling. In Proc. 23rd International Conference on Artificial Intelligence and Statistics 234&#x2013;244 (PMLR, 2020).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR47\" id=\"ref-link-section-d95382669e4331\" rel=\"nofollow noopener\" target=\"_blank\">47<\/a> was implemented on the MLP as in our previous work<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 74\" title=\"van Tilborg, D. &amp; Grisoni, F. Traversing chemical space with active deep learning for low-data drug discovery. Nat. Comput. Sci. 4, 786&#x2013;796 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR74\" id=\"ref-link-section-d95382669e4335\" rel=\"nofollow noopener\" target=\"_blank\">74<\/a>. We used an ensemble of M\u2009=\u200910 models. For each model, \\(m\\in [1\\ldots M]\\), we regularized its parameters \u03b8m with a set of \u2018anchored\u2019 parameters \u03b8anchor,m that prevent different models in the ensemble to converge to the same parameter space. Each model is initiated with distinct \u03b8anchor, which is controlled by different random seeds. The classification loss in our implementation is defined as<\/p>\n<p>$${{\\mathscr{L}}}_{\\mathrm{MLP}}=-\\frac{1}{M}\\mathop{\\sum }\\limits_{m=1}^{M}\\mathop{\\underbrace{\\log {p}_{m}\\left(\\,{y|x}\\right)s}}\\limits_{\\text{prediction}}+\\mathop{\\underbrace{\\lambda {{||}{\\theta }^{m}-{\\theta }_{\\mathrm{anchor}}^{m}{||}}^{2}}}\\limits_{\\mathrm{anchoring}},\\,$$<\/p>\n<p>\n                    (10)\n                <\/p>\n<p>where \u03bb is a regularization coefficient (set to 3\u2009\u00d7\u200910\u22124). To estimate the expected value \\({\\mathbb{E}}\\) of each molecule x, we take the mean prediction over the whole ensemble, as follows:<\/p>\n<p>$${\\mathbb{E}}\\left(\\,{y|x}\\right)=\\frac{1}{M}\\mathop{\\sum }\\limits_{m=1}^{M}{p}_{m}\\left(\\,y,|,x\\right).$$<\/p>\n<p>\n                    (11)\n                <\/p>\n<p>Similarly, we estimate prediction uncertainty for each molecule \\(x\\) as the mean entropy \\({\\mathbb{H}}\\) over all models in the ensemble:<\/p>\n<p>$${{{\\mathbb{H}}}}\\left({y|x}\\right)=-\\frac{1}{M}\\mathop{\\sum }\\limits_{m=1}^{M}{p}_{m}\\left(\\,{y|x}\\right)\\log {p}_{m}\\left(\\,{y|x}\\right).$$<\/p>\n<p>\n                    (12)\n                <\/p>\n<p>JMM<\/p>\n<p>Canonical SMILES strings were encoded into compressed latent molecular representations \\({\\bf{z}}\\) using the encoder model described above. Subsequently, \\({\\bf{z}}\\) was used to perform both molecular property prediction with a classifier and molecular reconstruction using the \\({\\bf{z}}\\)-conditioned decoder model<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 27\" title=\"G&#xF3;mez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268&#x2013;276 (2018).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR27\" id=\"ref-link-section-d95382669e4858\" rel=\"nofollow noopener\" target=\"_blank\">27<\/a>. The model was optimized in a joint fashion using the following weighted composite loss function:<\/p>\n<p>$${{\\mathscr{L}}}_{\\mathrm{JMM}}={{\\mathscr{L}}}_{\\mathrm{reconstruction}}+\\gamma \\times {{\\mathscr{L}}}_{\\mathrm{MLP}},$$<\/p>\n<p>\n                    (13)\n                <\/p>\n<p>where the scalar \u03b3 was set at 0.1 based on preliminary experiments. A regular autoencoder was used as preliminary experiments showed no performance benefits of the more complex variational autoencoder<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 84\" title=\"Kipf, T. N. &amp; Welling, M. Variational graph auto-encoders. In Proc. NeurIPS Workshop on Bayesian Deep Learning (Curran Associates, 2016).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR84\" id=\"ref-link-section-d95382669e4935\" rel=\"nofollow noopener\" target=\"_blank\">84<\/a> architecture.<\/p>\n<p>RF<\/p>\n<p>An RF classifier was trained on either ECFPs or CATS<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 56\" title=\"Reutlinger, M. et al. Chemically Advanced Template Search (CATS) for scaffold-hopping and prospective target prediction for &#x2018;orphan&#x2019; molecules. Mol. Inform. 32, 133&#x2013;138 (2013).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR56\" id=\"ref-link-section-d95382669e4948\" rel=\"nofollow noopener\" target=\"_blank\">56<\/a> descriptors. The following hyperparameters were optimized (see \u2018Hyperparameter optimization\u2019): the number of trees [100, 250, 500, 1,000], the maximal tree depth [10, 20, 30, \u221e] and the minimal samples per split [2, 5, 10].<\/p>\n<p>Model trainingAutoencoders<\/p>\n<p>Encoder\u2013decoder models were (pre)trained to reconstruct SMILES strings of general drug-like molecules from ChEMBL using the Adam optimizer. Mini batches of 256 random molecules were sampled from the training data for 1,000,000 steps using uniform sampling. Gradients were clipped with a max norm of 5. Early stopping with a patience of up to 20 evaluation checkpoints was implemented by monitoring validation loss every 10,000 steps. The model checkpoint with the best validation loss was used.<\/p>\n<p>Classifiers<\/p>\n<p>Classifiers using ECFPs or SMILES strings as input were trained for molecular property prediction on each of the labelled dataset using a similar setup to the autoencoders. However, tenfold Monte Carlo cross-validation was used with 10% validation splits instead of one predefined data split. Mini batches of 64 were resampled during training based on the occurrence of their class with<\/p>\n<p>$${P}_{c}=1-\\frac{{n}_{c}}{N},$$<\/p>\n<p>\n                    (14)\n                <\/p>\n<p>where Pc is the probability of sampling class c, nc is the number of samples of class c, and N is the total number of samples. Models were trained for 5,000 steps with an early stopping patience of 10 evaluation checkpoints, performed every 10 steps. The model checkpoint with the best validation loss was used. For the RF control models, molecules were weighted inversely proportionally to their class frequency to mitigate class imbalance during training.<\/p>\n<p>JMMs<\/p>\n<p>Joint models, each consisting of a SMILES string encoder, a decoder and a classifier (Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#Fig1\" rel=\"nofollow noopener\" target=\"_blank\">1b<\/a>), were initialized with pretrained weights. The SMILES string encoders and classifiers used weights from models trained on the labelled datasets. For the decoder, decoder weights were used from an autoencoder pretrained on ChEMBL. Using a mini batch size of 64, the joint models were finetuned for 10,000 steps with an early stopping patience of 50 evaluation checkpoints, performed every 20 steps. A learning rate of 3\u2009\u00d7\u200910\u22126 was used for the encoder and classifier, whereas a learning rate of 3\u2009\u00d7\u200910\u22127 was used for the decoder. No weight decay was applied.<\/p>\n<p>Hyperparameter optimization<\/p>\n<p>Hyperparameters, as specified previously, were optimized for all autoencoders, classifiers (using ECFPs or SMILES strings), and RF control models using a simple grid search. Tenfold Monte Carlo cross-validation was used, repeatedly using 10% of the training data as a validation split. The hyperparameters of the model with the best mean validation loss was used.<\/p>\n<p>Model evaluation<\/p>\n<p>Predictions were evaluated according to the following performance metrics:<\/p>\n<p>$$\\text{Balanced accuracy}=\\frac{1}{2}\\left(\\frac{\\mathrm{TP}}{\\mathrm{TP}+\\mathrm{FN}}+\\frac{\\mathrm{TN}}{\\mathrm{TN}+\\mathrm{FP}}\\right),$$<\/p>\n<p>\n                    (15)\n                <\/p>\n<p>$$\\text{Precision}=\\frac{\\mathrm{TP}}{\\mathrm{TP}+\\mathrm{FP}},$$<\/p>\n<p>\n                    (16)\n                <\/p>\n<p>where TP is the number of true (that is, correctly predicted) positives, TN is the number of true negatives, FN is the number of false negatives, and FP is the number of false positives. In addition, the hit rate (true positive rate) was determined for virtual screening experiments as<\/p>\n<p>$$\\text{Hit rate}=\\frac{\\mathrm{TP}}{P},$$<\/p>\n<p>\n                    (17)\n                <\/p>\n<p>and enrichment factor as<\/p>\n<p>$$\\text{Enrichment factor}=\\frac{{\\mathrm{TP}}_{k}}{P\/N}.$$<\/p>\n<p>\n                    (18)\n                <\/p>\n<p>Here, TPk is the number of correctly identified positives in the subset of k prioritized molecules, P is the total number of positives in the full dataset, and N is the total number of molecules in the full dataset.<\/p>\n<p>Biological characterization<\/p>\n<p>Sixty screening compounds were purchased from Specs Compound Handling B.V. and dissolved at 10\u2009mM in 100% DMSO.<\/p>\n<p>To screen for bioactivity, a point screening was first performed at a concentration of 10\u2009\u00b5M (in 1% DMSO) in technical triplicates using the ADP-Glo Kinase Assay platform from Promega using the Chemi-Verse PIM1 Kinase Assay Kit and the Chemi-Verse CDK1\/CyclinA2 Kinase Assay Kit from BPS Bioscience in Costar flat white 96-well plates. AZD1208 (CAS 1204144-28-4) and dinaciclib (CAS 779353-01-4), purchased from TargetMol Chemicals, were used as positive controls for PIM1 and CDK1, respectively. Bioactivity was measured as the area under the curve of an 18-step luminescence scan between 398\u2009nm and 653\u2009nm with an integration time of 1\u2009s and a settle time of 100\u2009ms, normalized for the signal in buffer-only wells.<\/p>\n<p>For each target protein, the six compounds with the highest bioactivity at 10\u2009\u00b5M were followed up with an 8-point dose\u2013response curve using the same assay. Screening compounds were measured in technical triplicates from 10\u2009\u00b5M to 0.0046\u2009\u00b5M, whereas reference compounds were measured in duplicate.<\/p>\n<p>Hardware and training set-up<\/p>\n<p>All computational experiments were performed on a Lenovo ThinkSystem SD650-N v2 server equipped with Intel Xeon Platinum 8360Y central processing units and NVIDIA A100 (40\u2009GB) graphics processing units. Up to five models were trained in parallel on a single graphics processing unit.<\/p>\n<p>Software and code<\/p>\n<p>All code was implemented in Python (v. 3.12). Deep learning models were implemented using PyTorch (v. 2.3.0)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 85\" title=\"Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8024&#x2013;8035 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR85\" id=\"ref-link-section-d95382669e5304\" rel=\"nofollow noopener\" target=\"_blank\">85<\/a>. Traditional machine learning models and clustering was implemented using Sci-kit learn v.1.4.0 (ref. <a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 79\" title=\"Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825&#x2013;2830 (2011).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR79\" id=\"ref-link-section-d95382669e5308\" rel=\"nofollow noopener\" target=\"_blank\">79<\/a>). All molecular data were handled using RDKit (v.2024.3.3)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 77\" title=\"Landrum, G. RDKit: Open-Source Cheminformatics. RDKit &#010;                https:\/\/www.rdkit.org&#010;                &#010;               (2006).\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#ref-CR77\" id=\"ref-link-section-d95382669e5312\" rel=\"nofollow noopener\" target=\"_blank\">77<\/a>. For data visualization, R (v.4.3.0) and the R package ggplot2 (v.3.4.2) were used along with Adobe Illustrator.<\/p>\n<p>Reporting summary<\/p>\n<p>Further information on research design is available in the <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s42256-026-01216-w#MOESM2\" rel=\"nofollow noopener\" target=\"_blank\">Nature Portfolio Reporting Summary<\/a> linked to this article.<\/p>\n","protected":false},"excerpt":{"rendered":"Data preprocessing and analysisMolecular representation and description Each molecular structure in this study was represented as a SMILES42&hellip;\n","protected":false},"author":2,"featured_media":546012,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[554,733,4308,187935,6112,4679,3250,86,56,54,55,187936],"class_list":{"0":"post-546011","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-cheminformatics","12":"tag-computer-science","13":"tag-engineering","14":"tag-general","15":"tag-technology","16":"tag-uk","17":"tag-united-kingdom","18":"tag-unitedkingdom","19":"tag-virtual-screening"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts\/546011","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/comments?post=546011"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts\/546011\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/media\/546012"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/media?parent=546011"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/categories?post=546011"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/tags?post=546011"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}