Profiling and analysis of chemical compounds using pointwise mutual information
Status PubMed-not-MEDLINE Jazyk angličtina Země Anglie, Velká Británie Médium electronic
Typ dokumentu časopisecké články
Grantová podpora
RVO 68378050-KAV-NPUI
Ministry of Education of the Czech Republic
LM2018130
Ministry of Education of the Czech Republic
PubMed
33423694
PubMed Central
PMC7798221
DOI
10.1186/s13321-020-00483-y
PII: 10.1186/s13321-020-00483-y
Knihovny.cz E-zdroje
- Klíčová slova
- Hashed fingerprint, Information theory, Pointwise mutual information, Structural key, Synthetic accessibility,
- Publikační typ
- časopisecké články MeSH
Pointwise mutual information (PMI) is a measure of association used in information theory. In this paper, PMI is used to characterize several publicly available databases (DrugBank, ChEMBL, PubChem and ZINC) in terms of association strength between compound structural features resulting in database PMI interrelation profiles. As structural features, substructure fragments obtained by coding individual compounds as MACCS, PubChemKey and ECFP fingerprints are used. The analysis of publicly available databases reveals, in accord with other studies, unusual properties of DrugBank compounds which further confirms the validity of PMI profiling approach. Z-standardized relative feature tightness (ZRFT), a PMI-derived measure that quantifies how well the given compound's feature combinations fit these in a particular compound set, is applied for the analysis of compound synthetic accessibility (SA), as well as for the classification of compounds as easy (ES) and hard (HS) to synthesize. ZRFT value distributions are compared with these of SYBA and SAScore. The analysis of ZRFT values of structurally complex compounds in the SAVI database reveals oligopeptide structures that are mispredicted by SAScore as HS, while correctly predicted by ZRFT and SYBA as ES. Compared to SAScore, SYBA and random forest, ZRFT predictions are less accurate, though by a narrow margin (AccZRFT = 94.5%, AccSYBA = 98.8%, AccSAScore = 99.0%, AccRF = 97.3%). However, ZRFT ability to distinguish between ES and HS compounds is surprisingly high considering that while SYBA, SAScore and random forest are dedicated SA models, ZRFT is a generic measurement that merely quantifies the strength of interrelations between structural feature pairs. The results presented in the current work indicate that structural feature co-occurrence, quantified by PMI or ZRFT, contains a significant amount of information relevant to physico-chemical properties of organic compounds.
Zobrazit více v PubMed
Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948;27(4):623–656. doi: 10.1002/j.1538-7305.1948.tb00917.x. DOI
Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:26.
Everet S. The statistics of word cooccurrences: word pairs and collocations. Universität Stuttgart: Universität Stuttgart; 2005.
Flor M, Klebanov BG, Sheenan KM (2013) Lexical tightness and text complexity. In: 2th workshop of natural language processing for improving textual accessibility; Atlanta, Georgia, U.S.A. Association for Computational Linguistics, pp 29–38
Date SV, Marcotte EM. Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat Biotechnol. 2003;21(9):1055–1062. doi: 10.1038/nbt861. PubMed DOI
Xu H, Moni MA, Lio P. Network regularised Cox regression and multiplex network models to predict disease comorbidities and survival of cancer. Comput Biol Chem. 2015;59(Pt B):15–31. doi: 10.1016/j.compbiolchem.2015.08.010. PubMed DOI
Wallace R (2003) Comorbidity and anticomorbidity: autocognitive developmental disorders of structured psychosocial stress. arXiv q-bio:18. PubMed
Davis DA, Chawla NV. Exploring and exploiting disease interactions from multi-relational gene and phenotype networks. PLoS ONE. 2011;6(7):e22670. doi: 10.1371/journal.pone.0022670. PubMed DOI PMC
Godden JW, Bajorath J. Shannon entropy—a novel concept in molecular descriptor and diversity analysis. J Mol Graph Model. 2000;18(1):73–76. PubMed
Vogt M, Wassermann AM, Bajorath J. Application of information-theoretic concepts in chemoinformatics. Information. 2010;1(2):14. doi: 10.3390/info1020060. DOI
Godden JW, Stahura FL, Bajorath J. Variability of molecular descriptors in compound databases revealed by Shannon entropy calculations. J Chem Inf Comput Sci. 2000;40(3):796–800. doi: 10.1021/ci000321u. PubMed DOI
Gregori-Puigjane E, Mestres J. SHED: Shannon entropy descriptors from topological feature distributions. J Chem Inf Model. 2006;46(4):1615–1622. doi: 10.1021/ci0600509. PubMed DOI
Xue L, Godden JW, Stahura FL, Bajorath J. Design and evaluation of a molecular fingerprint involving the transformation of property descriptor values into a binary classification scheme. J Chem Inf Comput Sci. 2003;43(4):1151–1157. doi: 10.1021/ci030285+. PubMed DOI
Bonchev D, Kamenski D, Kamenska V. Symmetry and information-content of chemical structures. B Math Biol. 1976;38(2):119–133. doi: 10.1016/S0092-8240(76)80029-8. DOI
Fernandez-de Gortari E, Garcia-Jacas CR, Martinez-Mayorga K, Medina-Franco JL. Database fingerprint (DFP): an approach to represent molecular databases. J Cheminf. 2017;9:1–9. doi: 10.1186/s13321-017-0195-1. PubMed DOI PMC
Wang Y, Geppert H, Bajorath J. Shannon entropy-based fingerprint similarity search strategy. J Chem Inf Model. 2009;49(7):1687–1691. doi: 10.1021/ci900159f. PubMed DOI
Bender A, Mussa HY, Glen RC, Reiling S. Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): Evaluation of performance. J Chem Inf Comp Sci. 2004;44(5):1708–1718. doi: 10.1021/ci0498719. PubMed DOI
Venkatraman V, Dalby AR, Yang ZR. Evaluation of mutual information and genetic programming for feature selection in QSAR. J Chem Inf Comp Sci. 2004;44(5):1686–1692. doi: 10.1021/ci049933v. PubMed DOI
Martinez MJ, Ponzoni I, Diaz MF, Vazquez GE, Soto AJ. Visual analytics in cheminformatics: user-supervised descriptor selection for QSAR methods. J Cheminform. 2015;7:39. doi: 10.1186/s13321-015-0092-4. PubMed DOI PMC
Barigye SJ, Marrero-Ponce Y, Martinez-Lopez Y, Torrens F, Artiles-Martinez LM, Pino-Urias RW, Martinez-Santiago O. Relations frequency hypermatrices in mutual, conditional and joint entropy-based information indices. J Comput Chem. 2013;34(4):259–274. doi: 10.1002/jcc.23123. PubMed DOI
Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, Sajed T, Johnson D, Li C, Sayeeda Z, et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2018;46(D1):D1074–D1082. doi: 10.1093/nar/gkx1037. PubMed DOI PMC
Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrian-Uhalte E, et al. The ChEMBL database in 2017. Nucleic Acids Res. 2017;45(D1):D945–D954. doi: 10.1093/nar/gkw1074. PubMed DOI PMC
Mendez D, Gaulton A, Bento AP, Chambers J, De Veij M, Felix E, Magarinos MP, Mosquera JF, Mutowo P, Nowotka M, et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 2019;47(D1):D930–D940. doi: 10.1093/nar/gky1075. PubMed DOI PMC
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2019;47(D1):D1102–D1109. doi: 10.1093/nar/gky1033. PubMed DOI PMC
Sterling T, Irwin JJ. ZINC 15—ligand discovery for everyone. J Chem Inf Model. 2015;55(11):2324–2337. doi: 10.1021/acs.jcim.5b00559. PubMed DOI PMC
PubChem/CACTVS substructure keys. ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt. Accessed 21 Feb 2020.
Durant JL, Leland BA, Henry DR, Nourse JG. Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comp Sci. 2002;42(6):1273–1280. doi: 10.1021/ci010132r. PubMed DOI
Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model. 2010;50(5):742–754. doi: 10.1021/ci100050t. PubMed DOI
Church KW, Hanks P (1990) Word-association norms, mutual information, and lexicography. In: 27th Annual Meeting of the Association for Computational Linguistics, pp 76–83
Willett P, Barnard JM, Downs GM. Chemical similarity searching. J Chem Inf Comp Sci. 1998;38(6):983–996. doi: 10.1021/ci9800211. DOI
Bajorath J. Selected concepts and investigations in compound classification, molecular descriptor analysis, and virtual screening. J Chem Inf Comp Sci. 2001;41(2):233–245. doi: 10.1021/ci0001482. PubMed DOI
Cereto-Massague A, Ojeda MJ, Valls C, Mulero M, Garcia-Vallve S, Pujadas G. Molecular fingerprint similarity search in virtual screening. Methods. 2015;71:58–63. doi: 10.1016/j.ymeth.2014.08.005. PubMed DOI
RDKit: Open-source cheminformatics. http://www.rdkit.org. Accessed 24 Jan 2020
Chemfp. http://chemfp.com/. Accessed 21 Feb 2020
Dalke A. The chemfp project. J Cheminform. 2019;11:76. doi: 10.1186/s13321-019-0398-8. PubMed DOI PMC
IMI eTOX standardiser. https://pypi.org/project/standardiser/. Accessed 4 Feb 2020
Vorsilak M, Kolar M, Cmelo I, Svozil D. SYBA: Bayesian estimation of synthetic accessibility of organic compounds. J Cheminform. 2020;12:35. doi: 10.1186/s13321-020-00439-2. PubMed DOI PMC
https://cactus.nci.nih.gov/download/savi_download/. Accessed 20 Feb 2020
Hitesh P, Wolf I, Philip J, Yurii SM, Yuri P, Megan P, Nadya T, Marc N. Synthetically accessible virtual inventory (SAVI) ChemRxiv. 2020;12185559:1–31.
Chevillard F, Kolb P. SCUBIDOO: a large yet screenable and easily searchable database of computationally created chemical compounds optimized toward high likelihood of synthetic tractability. J Chem Inf Model. 2015;55(9):1824–1835. doi: 10.1021/acs.jcim.5b00203. PubMed DOI
Bertz SH. The first general index of molecular complexity. J Am Chem Soc. 1981;103(12):3599–3601. doi: 10.1021/ja00402a071. DOI
Whitlock HW. On the structure of total synthesis of complex natural products. J Organ Chem. 1998;63(22):7982–7989. doi: 10.1021/jo9814546. DOI
Barone R, Chanon M. A new and simple approach to chemical complexity. Application to the synthesis of natural products. J Chem Inf Comp Sci. 2001;41(2):269–272. doi: 10.1021/ci000145p. PubMed DOI
Allu TK, Oprea TI. Rapid evaluation of synthetic and molecular complexity for in silico chemistry. J Chem Inf Model. 2005;45(5):1237–1243. doi: 10.1021/ci0501387. PubMed DOI
Voršilák M, Svozil D. Nonpher: computational method for design of hard-to-synthesize structures. J Cheminform. 2017;9:20. doi: 10.1186/s13321-017-0206-2. PubMed DOI PMC
Hoksza D, Skoda P, Vorsilak M, Svozil D. Molpher: a software framework for systematic chemical space exploration. J Cheminform. 2014;6:7. doi: 10.1186/1758-2946-6-7. PubMed DOI PMC
Ruddigkeit L, van Deursen R, Blum LC, Reymond J-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J Chem Inf Model. 2012;52(11):2864–2875. doi: 10.1021/ci300415d. PubMed DOI
Ertl P, Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform. 2009;1:8. doi: 10.1186/1758-2946-1-8. PubMed DOI PMC
SYBA - SYnthetic BAyesian classifier. https://github.com/lich-uct/syba. Accessed 7 Aug 2020
Huang Q, Li L-L, Yang S-Y. RASA: a rapid retrosynthesis-based scoring method for the assessment of synthetic accessibility of drug-like molecules. J Chem Inf Model. 2011;51(10):2768–2777. doi: 10.1021/ci100216g. PubMed DOI
Boda K, Seidel T, Gasteiger J. Structure and reaction based evaluation of synthetic accessibility. J Comput-Aided Mol Des. 2007;21(6):311–325. doi: 10.1007/s10822-006-9099-2. PubMed DOI
Fukunishi Y, Kurosawa T, Mikami Y, Nakamura H. Prediction of synthetic accessibility based on commercially available compound databases. J Chem Inf Model. 2014;54(12):3259–3267. doi: 10.1021/ci500568d. PubMed DOI
Polishchuk PG, Madzhidov TI, Varnek A. Estimation of the size of drug-like chemical space based on GDB-17 data. J Comput Aided Mol Des. 2013;27(8):675–679. doi: 10.1007/s10822-013-9672-4. PubMed DOI
Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3(1):32–35. doi: 10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3. PubMed DOI
Fluss R, Faraggi D, Reiser B. Estimation of the Youden Index and its associated cutoff point. Biom J. 2005;47(4):458–472. doi: 10.1002/bimj.200410135. PubMed DOI
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–2830.
Sheridan RP. Using random forest to model the domain applicability of another random forest model. J Chem Inf Model. 2013;53(11):2837–2850. doi: 10.1021/ci400482e. PubMed DOI
Singh N, Guha R, Giulianotti MA, Pinilla C, Houghten RA, Medina-Franco JL. Chemoinformatic analysis of combinatorial libraries, drugs, natural products, and molecular libraries small molecule repository. J Chem Inf Model. 2009;49(4):1010–1024. doi: 10.1021/ci800426u. PubMed DOI PMC
Hu Y, Bajorath J. Many drugs contain unique scaffolds with varying structural relationships to scaffolds of currently available bioactive compounds. Eur J Med Chem. 2014;76:427–434. doi: 10.1016/j.ejmech.2014.02.040. PubMed DOI
Khanna V, Ranganathan S. Structural diversity of biologically interesting datasets: a scaffold analysis approach. J Cheminform. 2011;3:30. doi: 10.1186/1758-2946-3-30. PubMed DOI PMC
Lawrenson SB, Arav R, North M. The greening of peptide synthesis. Green Chem. 2017;19(7):1685–1691. doi: 10.1039/C7GC00247E. DOI
Lipinski CA, Lombardo F, Dominy BW, Feeney PJ. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings1. Adv Drug Deliv Rev. 2001;46(1–3):3–26. doi: 10.1016/S0169-409X(00)00129-0. PubMed DOI
Skuta C, Cortes-Ciriano I, Dehaen W, Kriz P, van Westen GJP, Tetko IV, Bender A, Svozil D. QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping. J Cheminform. 2020;12:39. doi: 10.1186/s13321-020-00443-6. PubMed DOI PMC
Cortes-Ciriano I, Skuta C, Bender A, Svozil D. QSAR-derived affinity fingerprints (part 2): modeling performance for potency prediction. J Cheminform. 2020;12:41. doi: 10.1186/s13321-020-00444-5. PubMed DOI PMC
Chen Y, Kirchmair J. Cheminformatics in natural product-based drug discovery. Mol Inform. 2020;39:2000171. doi: 10.1002/minf.202000171. PubMed DOI PMC
Jayaseelan KV, Moreno P, Truszkowski A, Ertl P, Steinbeck C. Natural product-likeness score revisited: an open-source, open-data implementation. BMC Bioinformatics. 2012;13:106. doi: 10.1186/1471-2105-13-106. PubMed DOI PMC
Seo M, Shin HK, Myung Y, Hwang S, No KT. Development of Natural Compound Molecular Fingerprint (NC-MFP) with the Dictionary of Natural Products (DNP) for natural product-based drug development. J Cheminform. 2020;12:6. doi: 10.1186/s13321-020-0410-3. PubMed DOI PMC