Profiling and analysis of chemical compounds using pointwise mutual information

. 2021 Jan 10 ; 13 (1) : 3. [epub] 20210110

Status PubMed-not-MEDLINE Jazyk angličtina Země Anglie, Velká Británie Médium electronic

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid33423694

Grantová podpora
RVO 68378050-KAV-NPUI Ministry of Education of the Czech Republic
LM2018130 Ministry of Education of the Czech Republic

Odkazy

PubMed 33423694
PubMed Central PMC7798221
DOI 10.1186/s13321-020-00483-y
PII: 10.1186/s13321-020-00483-y
Knihovny.cz E-zdroje

Pointwise mutual information (PMI) is a measure of association used in information theory. In this paper, PMI is used to characterize several publicly available databases (DrugBank, ChEMBL, PubChem and ZINC) in terms of association strength between compound structural features resulting in database PMI interrelation profiles. As structural features, substructure fragments obtained by coding individual compounds as MACCS, PubChemKey and ECFP fingerprints are used. The analysis of publicly available databases reveals, in accord with other studies, unusual properties of DrugBank compounds which further confirms the validity of PMI profiling approach. Z-standardized relative feature tightness (ZRFT), a PMI-derived measure that quantifies how well the given compound's feature combinations fit these in a particular compound set, is applied for the analysis of compound synthetic accessibility (SA), as well as for the classification of compounds as easy (ES) and hard (HS) to synthesize. ZRFT value distributions are compared with these of SYBA and SAScore. The analysis of ZRFT values of structurally complex compounds in the SAVI database reveals oligopeptide structures that are mispredicted by SAScore as HS, while correctly predicted by ZRFT and SYBA as ES. Compared to SAScore, SYBA and random forest, ZRFT predictions are less accurate, though by a narrow margin (AccZRFT = 94.5%, AccSYBA = 98.8%, AccSAScore = 99.0%, AccRF = 97.3%). However, ZRFT ability to distinguish between ES and HS compounds is surprisingly high considering that while SYBA, SAScore and random forest are dedicated SA models, ZRFT is a generic measurement that merely quantifies the strength of interrelations between structural feature pairs. The results presented in the current work indicate that structural feature co-occurrence, quantified by PMI or ZRFT, contains a significant amount of information relevant to physico-chemical properties of organic compounds.

Zobrazit více v PubMed

Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948;27(4):623–656. doi: 10.1002/j.1538-7305.1948.tb00917.x. DOI

Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:26.

Everet S. The statistics of word cooccurrences: word pairs and collocations. Universität Stuttgart: Universität Stuttgart; 2005.

Flor M, Klebanov BG, Sheenan KM (2013) Lexical tightness and text complexity. In: 2th workshop of natural language processing for improving textual accessibility; Atlanta, Georgia, U.S.A. Association for Computational Linguistics, pp 29–38

Date SV, Marcotte EM. Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat Biotechnol. 2003;21(9):1055–1062. doi: 10.1038/nbt861. PubMed DOI

Xu H, Moni MA, Lio P. Network regularised Cox regression and multiplex network models to predict disease comorbidities and survival of cancer. Comput Biol Chem. 2015;59(Pt B):15–31. doi: 10.1016/j.compbiolchem.2015.08.010. PubMed DOI

Wallace R (2003) Comorbidity and anticomorbidity: autocognitive developmental disorders of structured psychosocial stress. arXiv q-bio:18. PubMed

Davis DA, Chawla NV. Exploring and exploiting disease interactions from multi-relational gene and phenotype networks. PLoS ONE. 2011;6(7):e22670. doi: 10.1371/journal.pone.0022670. PubMed DOI PMC

Godden JW, Bajorath J. Shannon entropy—a novel concept in molecular descriptor and diversity analysis. J Mol Graph Model. 2000;18(1):73–76. PubMed

Vogt M, Wassermann AM, Bajorath J. Application of information-theoretic concepts in chemoinformatics. Information. 2010;1(2):14. doi: 10.3390/info1020060. DOI

Godden JW, Stahura FL, Bajorath J. Variability of molecular descriptors in compound databases revealed by Shannon entropy calculations. J Chem Inf Comput Sci. 2000;40(3):796–800. doi: 10.1021/ci000321u. PubMed DOI

Gregori-Puigjane E, Mestres J. SHED: Shannon entropy descriptors from topological feature distributions. J Chem Inf Model. 2006;46(4):1615–1622. doi: 10.1021/ci0600509. PubMed DOI

Xue L, Godden JW, Stahura FL, Bajorath J. Design and evaluation of a molecular fingerprint involving the transformation of property descriptor values into a binary classification scheme. J Chem Inf Comput Sci. 2003;43(4):1151–1157. doi: 10.1021/ci030285+. PubMed DOI

Bonchev D, Kamenski D, Kamenska V. Symmetry and information-content of chemical structures. B Math Biol. 1976;38(2):119–133. doi: 10.1016/S0092-8240(76)80029-8. DOI

Fernandez-de Gortari E, Garcia-Jacas CR, Martinez-Mayorga K, Medina-Franco JL. Database fingerprint (DFP): an approach to represent molecular databases. J Cheminf. 2017;9:1–9. doi: 10.1186/s13321-017-0195-1. PubMed DOI PMC

Wang Y, Geppert H, Bajorath J. Shannon entropy-based fingerprint similarity search strategy. J Chem Inf Model. 2009;49(7):1687–1691. doi: 10.1021/ci900159f. PubMed DOI

Bender A, Mussa HY, Glen RC, Reiling S. Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): Evaluation of performance. J Chem Inf Comp Sci. 2004;44(5):1708–1718. doi: 10.1021/ci0498719. PubMed DOI

Venkatraman V, Dalby AR, Yang ZR. Evaluation of mutual information and genetic programming for feature selection in QSAR. J Chem Inf Comp Sci. 2004;44(5):1686–1692. doi: 10.1021/ci049933v. PubMed DOI

Martinez MJ, Ponzoni I, Diaz MF, Vazquez GE, Soto AJ. Visual analytics in cheminformatics: user-supervised descriptor selection for QSAR methods. J Cheminform. 2015;7:39. doi: 10.1186/s13321-015-0092-4. PubMed DOI PMC

Barigye SJ, Marrero-Ponce Y, Martinez-Lopez Y, Torrens F, Artiles-Martinez LM, Pino-Urias RW, Martinez-Santiago O. Relations frequency hypermatrices in mutual, conditional and joint entropy-based information indices. J Comput Chem. 2013;34(4):259–274. doi: 10.1002/jcc.23123. PubMed DOI

Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, Sajed T, Johnson D, Li C, Sayeeda Z, et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2018;46(D1):D1074–D1082. doi: 10.1093/nar/gkx1037. PubMed DOI PMC

Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrian-Uhalte E, et al. The ChEMBL database in 2017. Nucleic Acids Res. 2017;45(D1):D945–D954. doi: 10.1093/nar/gkw1074. PubMed DOI PMC

Mendez D, Gaulton A, Bento AP, Chambers J, De Veij M, Felix E, Magarinos MP, Mosquera JF, Mutowo P, Nowotka M, et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 2019;47(D1):D930–D940. doi: 10.1093/nar/gky1075. PubMed DOI PMC

Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2019;47(D1):D1102–D1109. doi: 10.1093/nar/gky1033. PubMed DOI PMC

Sterling T, Irwin JJ. ZINC 15—ligand discovery for everyone. J Chem Inf Model. 2015;55(11):2324–2337. doi: 10.1021/acs.jcim.5b00559. PubMed DOI PMC

PubChem/CACTVS substructure keys. ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt. Accessed 21 Feb 2020.

Durant JL, Leland BA, Henry DR, Nourse JG. Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comp Sci. 2002;42(6):1273–1280. doi: 10.1021/ci010132r. PubMed DOI

Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model. 2010;50(5):742–754. doi: 10.1021/ci100050t. PubMed DOI

Church KW, Hanks P (1990) Word-association norms, mutual information, and lexicography. In: 27th Annual Meeting of the Association for Computational Linguistics, pp 76–83

Willett P, Barnard JM, Downs GM. Chemical similarity searching. J Chem Inf Comp Sci. 1998;38(6):983–996. doi: 10.1021/ci9800211. DOI

Bajorath J. Selected concepts and investigations in compound classification, molecular descriptor analysis, and virtual screening. J Chem Inf Comp Sci. 2001;41(2):233–245. doi: 10.1021/ci0001482. PubMed DOI

Cereto-Massague A, Ojeda MJ, Valls C, Mulero M, Garcia-Vallve S, Pujadas G. Molecular fingerprint similarity search in virtual screening. Methods. 2015;71:58–63. doi: 10.1016/j.ymeth.2014.08.005. PubMed DOI

RDKit: Open-source cheminformatics. http://www.rdkit.org. Accessed 24 Jan 2020

Chemfp. http://chemfp.com/. Accessed 21 Feb 2020

Dalke A. The chemfp project. J Cheminform. 2019;11:76. doi: 10.1186/s13321-019-0398-8. PubMed DOI PMC

IMI eTOX standardiser. https://pypi.org/project/standardiser/. Accessed 4 Feb 2020

Vorsilak M, Kolar M, Cmelo I, Svozil D. SYBA: Bayesian estimation of synthetic accessibility of organic compounds. J Cheminform. 2020;12:35. doi: 10.1186/s13321-020-00439-2. PubMed DOI PMC

https://cactus.nci.nih.gov/download/savi_download/. Accessed 20 Feb 2020

Hitesh P, Wolf I, Philip J, Yurii SM, Yuri P, Megan P, Nadya T, Marc N. Synthetically accessible virtual inventory (SAVI) ChemRxiv. 2020;12185559:1–31.

Chevillard F, Kolb P. SCUBIDOO: a large yet screenable and easily searchable database of computationally created chemical compounds optimized toward high likelihood of synthetic tractability. J Chem Inf Model. 2015;55(9):1824–1835. doi: 10.1021/acs.jcim.5b00203. PubMed DOI

Bertz SH. The first general index of molecular complexity. J Am Chem Soc. 1981;103(12):3599–3601. doi: 10.1021/ja00402a071. DOI

Whitlock HW. On the structure of total synthesis of complex natural products. J Organ Chem. 1998;63(22):7982–7989. doi: 10.1021/jo9814546. DOI

Barone R, Chanon M. A new and simple approach to chemical complexity. Application to the synthesis of natural products. J Chem Inf Comp Sci. 2001;41(2):269–272. doi: 10.1021/ci000145p. PubMed DOI

Allu TK, Oprea TI. Rapid evaluation of synthetic and molecular complexity for in silico chemistry. J Chem Inf Model. 2005;45(5):1237–1243. doi: 10.1021/ci0501387. PubMed DOI

Voršilák M, Svozil D. Nonpher: computational method for design of hard-to-synthesize structures. J Cheminform. 2017;9:20. doi: 10.1186/s13321-017-0206-2. PubMed DOI PMC

Hoksza D, Skoda P, Vorsilak M, Svozil D. Molpher: a software framework for systematic chemical space exploration. J Cheminform. 2014;6:7. doi: 10.1186/1758-2946-6-7. PubMed DOI PMC

Ruddigkeit L, van Deursen R, Blum LC, Reymond J-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J Chem Inf Model. 2012;52(11):2864–2875. doi: 10.1021/ci300415d. PubMed DOI

Ertl P, Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform. 2009;1:8. doi: 10.1186/1758-2946-1-8. PubMed DOI PMC

SYBA - SYnthetic BAyesian classifier. https://github.com/lich-uct/syba. Accessed 7 Aug 2020

Huang Q, Li L-L, Yang S-Y. RASA: a rapid retrosynthesis-based scoring method for the assessment of synthetic accessibility of drug-like molecules. J Chem Inf Model. 2011;51(10):2768–2777. doi: 10.1021/ci100216g. PubMed DOI

Boda K, Seidel T, Gasteiger J. Structure and reaction based evaluation of synthetic accessibility. J Comput-Aided Mol Des. 2007;21(6):311–325. doi: 10.1007/s10822-006-9099-2. PubMed DOI

Fukunishi Y, Kurosawa T, Mikami Y, Nakamura H. Prediction of synthetic accessibility based on commercially available compound databases. J Chem Inf Model. 2014;54(12):3259–3267. doi: 10.1021/ci500568d. PubMed DOI

Polishchuk PG, Madzhidov TI, Varnek A. Estimation of the size of drug-like chemical space based on GDB-17 data. J Comput Aided Mol Des. 2013;27(8):675–679. doi: 10.1007/s10822-013-9672-4. PubMed DOI

Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3(1):32–35. doi: 10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3. PubMed DOI

Fluss R, Faraggi D, Reiser B. Estimation of the Youden Index and its associated cutoff point. Biom J. 2005;47(4):458–472. doi: 10.1002/bimj.200410135. PubMed DOI

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–2830.

Sheridan RP. Using random forest to model the domain applicability of another random forest model. J Chem Inf Model. 2013;53(11):2837–2850. doi: 10.1021/ci400482e. PubMed DOI

Singh N, Guha R, Giulianotti MA, Pinilla C, Houghten RA, Medina-Franco JL. Chemoinformatic analysis of combinatorial libraries, drugs, natural products, and molecular libraries small molecule repository. J Chem Inf Model. 2009;49(4):1010–1024. doi: 10.1021/ci800426u. PubMed DOI PMC

Hu Y, Bajorath J. Many drugs contain unique scaffolds with varying structural relationships to scaffolds of currently available bioactive compounds. Eur J Med Chem. 2014;76:427–434. doi: 10.1016/j.ejmech.2014.02.040. PubMed DOI

Khanna V, Ranganathan S. Structural diversity of biologically interesting datasets: a scaffold analysis approach. J Cheminform. 2011;3:30. doi: 10.1186/1758-2946-3-30. PubMed DOI PMC

Lawrenson SB, Arav R, North M. The greening of peptide synthesis. Green Chem. 2017;19(7):1685–1691. doi: 10.1039/C7GC00247E. DOI

Lipinski CA, Lombardo F, Dominy BW, Feeney PJ. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings1. Adv Drug Deliv Rev. 2001;46(1–3):3–26. doi: 10.1016/S0169-409X(00)00129-0. PubMed DOI

Skuta C, Cortes-Ciriano I, Dehaen W, Kriz P, van Westen GJP, Tetko IV, Bender A, Svozil D. QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping. J Cheminform. 2020;12:39. doi: 10.1186/s13321-020-00443-6. PubMed DOI PMC

Cortes-Ciriano I, Skuta C, Bender A, Svozil D. QSAR-derived affinity fingerprints (part 2): modeling performance for potency prediction. J Cheminform. 2020;12:41. doi: 10.1186/s13321-020-00444-5. PubMed DOI PMC

Chen Y, Kirchmair J. Cheminformatics in natural product-based drug discovery. Mol Inform. 2020;39:2000171. doi: 10.1002/minf.202000171. PubMed DOI PMC

Jayaseelan KV, Moreno P, Truszkowski A, Ertl P, Steinbeck C. Natural product-likeness score revisited: an open-source, open-data implementation. BMC Bioinformatics. 2012;13:106. doi: 10.1186/1471-2105-13-106. PubMed DOI PMC

Seo M, Shin HK, Myung Y, Hwang S, No KT. Development of Natural Compound Molecular Fingerprint (NC-MFP) with the Dictionary of Natural Products (DNP) for natural product-based drug development. J Cheminform. 2020;12:6. doi: 10.1186/s13321-020-0410-3. PubMed DOI PMC

Najít záznam

Citační ukazatele

Nahrávání dat ...

Možnosti archivace

Nahrávání dat ...