Sachem: a chemical cartridge for high-performance substructure search
Status PubMed-not-MEDLINE Jazyk angličtina Země Velká Británie, Anglie Médium electronic
Typ dokumentu časopisecké články
Grantová podpora
LM2015047
Ministerstvo Školství, Mládeže a Tělovýchovy
61388963
Institute of Organic Chemistry and Biochemistry of the CAS (RVO)
PubMed
29797000
PubMed Central
PMC5966370
DOI
10.1186/s13321-018-0282-y
PII: 10.1186/s13321-018-0282-y
Knihovny.cz E-zdroje
- Klíčová slova
- Inverted indices, Molecule cartridges, Small molecule databases, Substructure search,
- Publikační typ
- časopisecké články MeSH
BACKGROUND: Structure search is one of the valuable capabilities of small-molecule databases. Fingerprint-based screening methods are usually employed to enhance the search performance by reducing the number of calls to the verification procedure. In substructure search, fingerprints are designed to capture important structural aspects of the molecule to aid the decision about whether the molecule contains a given substructure. Currently available cartridges typically provide acceptable search performance for processing user queries, but do not scale satisfactorily with dataset size. RESULTS: We present Sachem, a new open-source chemical cartridge that implements two substructure search methods: The first is a performance-oriented reimplementation of substructure indexing based on the OrChem fingerprint, and the second is a novel method that employs newly designed fingerprints stored in inverted indices. We assessed the performance of both methods on small, medium, and large datasets containing 1, 10, and 94 million compounds, respectively. Comparison of Sachem with other freely available cartridges revealed improvements in overall performance, scaling potential and screen-out efficiency. CONCLUSIONS: The Sachem cartridge allows efficient substructure searches in databases of all sizes. The sublinear performance scaling of the second method and the ability to efficiently query large amounts of pre-extracted information may together open the door to new applications for substructure searches.
Zobrazit více v PubMed
Venkatraman V, Pérez-Nueno VI, Mavridis L, Ritchie DW. Comprehensive comparison of ligand-based virtual screening tools against the DUD data set reveals limitations of current 3D methods. J Chem Inf Model. 2010;50(12):2079–2093. doi: 10.1021/ci100263p. PubMed DOI
Weskamp N. Guided iterative substructure search (GI-SSS)-a new trick for an old dog. Mol Inform. 2016;35(6–7):286–292. doi: 10.1002/minf.201600063. PubMed DOI
Barnard JM. Substructure searching methods: old and new. J Chem Inf Comput Sci. 1993;33(4):532–538. doi: 10.1021/ci00014a001. DOI
Zhuang C, Narayanapillai S, Zhang W, Sham YY, Xing C. Rapid identification of Keap1-Nrf2 small-molecule inhibitors through structure-based virtual screening and hit-based substructure search. J Med Chem. 2014;57(3):1121–1126. doi: 10.1021/jm4017174. PubMed DOI
Sheridan RP, Kearsley SK. Why do we need so many chemical similarity search methods? Drug Discov Today. 2002;7(17):903–911. doi: 10.1016/S1359-6446(02)02411-X. PubMed DOI
Cereto-Massagué A, Ojeda MJ, Valls C, Mulero M, Garcia-Vallvé S, Pujadas G. Molecular fingerprint similarity search in virtual screening. Methods. 2015;71:58–63. doi: 10.1016/j.ymeth.2014.08.005. PubMed DOI
Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen EL. Recent developments of the chemistry development kit (CDK)-an open-source java library for chemo-and bioinformatics. Curr Pharm Des. 2006;12(17):2111–2120. doi: 10.2174/138161206777585274. PubMed DOI
Rijnbeek M, Steinbeck C. OrChem—an open source chemistry search engine for Oracle®. J Cheminform. 2009;1(1):17. doi: 10.1186/1758-2946-1-17. PubMed DOI PMC
Ihlenfeldt WD, Takahashi Y, Abe H, Sasaki Si. Computation and management of chemical properties in CACTVS: an extensible networked approach toward modularity and compatibility. J Chem Inf Comput Sci. 1994;34(1):109–116. doi: 10.1021/ci00017a013. DOI
Brown RD, Martin YC. Use of structure- activity data to compare structure-based clustering methods and descriptors for use in compound selection. J Chem Inf Comput Sci. 1996;36(3):572–584. doi: 10.1021/ci9501047. DOI
Yap CW. PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 2011;32(7):1466–1474. doi: 10.1002/jcc.21707. PubMed DOI
Liu P, Agrafiotis DK, Rassokhin DN. Power Keys: a novel class of topological descriptors based on exhaustive subgraph enumeration and their application in substructure searching. J Chem Inf Model. 2011;51(11):2843–2851. doi: 10.1021/ci200282z. PubMed DOI
O’Boyle NM, Sayle RA. Comparing structural fingerprints using a literature-based similarity benchmark. J Cheminform. 2016;8(1):36. doi: 10.1186/s13321-016-0148-0. PubMed DOI PMC
pgFoundry::pgChem::Tigress [Web page] (2011) http://pgfoundry.org/projects/pgchem/. Accessed 9 Apr 2018
Pavlov D, Rybalkin M, Karulin B. Bingo from SciTouch LLC: chemistry cartridge for Oracle database. J Cheminform. 2010;2:1–1. doi: 10.1186/1758-2946-2-S1-F1. PubMed DOI
Degtyarenko K, De Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, et al. ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 2007;36(suppl-1):D344–D350. doi: 10.1093/nar/gkm791. PubMed DOI PMC
Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model. 2010;50(5):742–754. doi: 10.1021/ci100050t. PubMed DOI
Broder A, Mitzenmacher M. Network applications of Bloom filters: a survey. Internet Math. 2004;1(4):485–509. doi: 10.1080/15427951.2004.10129096. DOI
Białecki A, Muir R, Ingersoll G (2012) Lucid Imagination. Apache lucene 4. In: SIGIR 2012 workshop on open source information retrieval, p 17
Apache Lucy [Web page] (2017) https://lucy.apache.org/. Accessed 9 Apr 2018
Smiley D, Pugh E, Parisa K, Mitchell M. Apache Solr enterprise search server. Birmingham: Packt Publishing Ltd; 2015.
Kuc R, Rogozinski M. Elasticsearch server. Birmingham: Packt Publishing Ltd; 2013.
Liu P, Agrafiotis DK, Rassokhin DN, Yang E. Accelerating chemical database searching using graphics processing units. J Cem Inf Model. 2011;51(8):1807–1816. doi: 10.1021/ci200164g. PubMed DOI
Tao L, Zhang P, Qin C, Chen S, Zhang C, Chen Z, et al. Recent progresses in the exploration of machine learning methods as in-silico ADME prediction tools. Adv Ddrug Deliv Rev. 2015;86:83–100. doi: 10.1016/j.addr.2015.03.014. PubMed DOI
Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A et al. (2015) Convolutional networks on graphs for learning molecular fingerprints. In: Advances in neural information processing systems, pp 2224–2232
Lavecchia A. Machine-learning approaches in drug discovery: methods and applications. Drug Discov Today. 2015;20(3):318–331. doi: 10.1016/j.drudis.2014.10.012. PubMed DOI
Landrum G et al. (2006) RDKit: open-source cheminformatics
MyChem [Web page] (2015) http://mychem.sourceforge.net/. Accessed 9 Apr 2018
O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: an open chemical toolbox. J Cheminform. 2011;3(1):33. doi: 10.1186/1758-2946-3-33. PubMed DOI PMC
Zamora A. An algorithm for finding the smallest set of smallest rings. J Chem Inf Comput Sci. 1976;16(1):40–43. doi: 10.1021/ci60005a013. DOI
O’Boyle NM, Guha R, Willighagen EL, Adams SE, Alvarsson J, Bradley JC, et al. Open data, open source and open standards in chemistry: the blue obelisk five years on. J Cheminform. 2011;3(1):37. doi: 10.1186/1758-2946-3-37. PubMed DOI PMC
Martin E, Monge A, Duret JA, Gualandi F, Peitsch MC, Pospisil P. Building an R&D chemical registration system. J Cheminform. 2012;4(1):11. doi: 10.1186/1758-2946-4-11. PubMed DOI PMC
Guilloux VL, Arrault A, Colliandre L, Bourg S, Vayer P, Morin-Allory L. Mining collections of compounds with screening assistant 2. J Cheminform. 2012;4(1):20. doi: 10.1186/1758-2946-4-20. PubMed DOI PMC
May J, Sayle R (2015) Substructure search faceoff; 2015. Cambridge cheminformatics network meeting. https://www.slideshare.net/NextMoveSoftware/substructure-search-faceoff. Accessed 9 Apr 2018
Dalke A (2014) Substructural query collection; 2014. https://bitbucket.org/dalke/sqc. Accessed 09 Apr 2018
Ehrlich HC, Rarey M. Systematic benchmark of substructure search in molecular graphs-from Ullmann to VF2. J Cheminform. 2012;4(1):13. doi: 10.1186/1758-2946-4-13. PubMed DOI PMC
Sitzmann M, Ihlenfeldt WD, Nicklaus MC. Tautomerism in large databases. J Comput-Aid Mol Des. 2010;24(6–7):521–551. doi: 10.1007/s10822-010-9346-4. PubMed DOI PMC
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information retrieval and text mining technologies for chemistry. Chem Rev. 2017;117(12):7673–7761. doi: 10.1021/acs.chemrev.6b00851. PubMed DOI
Agrafiotis DK, Lobanov VS, Shemanarev M, Rassokhin DN, Izrailev S, Jaeger EP, et al. Efficient substructure searching of large chemical libraries: the ABCD chemical cartridge. J Chem Inf Model. 2011;51(12):3113–3130. doi: 10.1021/ci200413e. PubMed DOI
The IDSM mass spectrometry extension: searching mass spectra using SPARQL
The LOTUS initiative for open knowledge management in natural products research
IDSM ChemWebRDF: SPARQLing small-molecule datasets
Interoperable chemical structure search service