IDSM ChemWebRDF: SPARQLing small-molecule datasets
Status PubMed-not-MEDLINE Jazyk angličtina Země Anglie, Velká Británie Médium electronic
Typ dokumentu časopisecké články
Grantová podpora
LM2018131
Ministerstvo Školství, Mládeže a Telovýchovy
RVO:61388963
Ústav Organické Chemie a Biochemie, Akademie Ved Ceské Republiky
PubMed
33980298
PubMed Central
PMC8117646
DOI
10.1186/s13321-021-00515-1
PII: 10.1186/s13321-021-00515-1
Knihovny.cz E-zdroje
- Klíčová slova
- Resource Descriptor Framework, SPARQL, Small-molecule datasets,
- Publikační typ
- časopisecké články MeSH
The Resource Description Framework (RDF), together with well-defined ontologies, significantly increases data interoperability and usability. The SPARQL query language was introduced to retrieve requested RDF data and to explore links between them. Among other useful features, SPARQL supports federated queries that combine multiple independent data source endpoints. This allows users to obtain insights that are not possible using only a single data source. Owing to all of these useful features, many biological and chemical databases present their data in RDF, and support SPARQL querying. In our project, we primary focused on PubChem, ChEMBL and ChEBI small-molecule datasets. These datasets are already being exported to RDF by their creators. However, none of them has an official and currently supported SPARQL endpoint. This omission makes it difficult to construct complex or federated queries that could access all of the datasets, thus underutilising the main advantage of the availability of RDF data. Our goal is to address this gap by integrating the datasets into one database called the Integrated Database of Small Molecules (IDSM) that will be accessible through a SPARQL endpoint. Beyond that, we will also focus on increasing mutual interoperability of the datasets. To realise the endpoint, we decided to implement an in-house developed SPARQL engine based on the PostgreSQL relational database for data storage. In our approach, data are stored in the traditional relational form, and the SPARQL engine translates incoming SPARQL queries into equivalent SQL queries. An important feature of the engine is that it optimises the resulting SQL queries. Together with optimisations performed by PostgreSQL, this allows efficient evaluations of SPARQL queries. The endpoint provides not only querying in the dataset, but also the compound substructure and similarity search supported by our Sachem project. Although the endpoint is accessible from an internet browser, it is mainly intended to be used for programmatic access by other services, for example as a part of federated queries. For regular users, we offer a rich web application called ChemWebRDF using the endpoint. The application is publicly available at https://idsm.elixir-czech.cz/chemweb/ .
Zobrazit více v PubMed
Berners-Lee T (2009) Linked Data. [cito:citesAsAuthority]. https://www.w3.org/DesignIssues/LinkedData.html
Cyganiak R, Wood D, Lanthaler M (2014) RDF 1.1 Concepts and Abstract Syntax. [cito:citesAsAuthority]. https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/
Schreiber G, Raimond, Y (2014) RDF 1.1 Primer. [cito:citesAsAuthority]. https://www.w3.org/TR/2014/NOTE-rdf11-primer-20140624/
Duerst M, Suignard M (2005) Internationalized Resource Identifiers (IRIs). [cito:citesAsAuthority]. https://tools.ietf.org/html/rfc3987
Brickley D, Guha RV (2014) RDF Schema 1.1. [cito:citesAsAuthority] . https://www.w3.org/TR/2014/REC-rdf-schema-20140225/
Group WOW (2012) OWL 2 Web Ontology Language Document Overview (Second Edition). [cito:citesAsAuthority]. https://www.w3.org/TR/2012/REC-owl2-overview-20121211/
Harris S, Seaborne A (2013) SPARQL 1.1 Query Language. [cito:citesAsAuthority] . https://www.w3.org/TR/2013/REC-sparql11-query-20130321/
Prud’hommeaux E, Buil-Aranda C (2013) SPARQL 1.1 Federated Query. [cito:citesAsAuthority] . https://www.w3.org/TR/2013/REC-sparql11-federated-query-20130321/
Feigenbaum L, Williams GT, Clark KG, Torres E (2013) SPARQL 1.1 Protocol. [cito:citesAsAuthority]. https://www.w3.org/TR/2013/REC-sparql11-protocol-20130321/
Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette J. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J Biomed Inform. 2008;41(5):706–16. doi: 10.1016/j.jbi.2008.03.004. PubMed DOI
Callahan A, Cruz-Toledo J, Ansell P, Dumontier M. Bio2RDF release 2: improved coverage, interoperability and provenance of life science linked data. The semantic web: semantics and big data, pp 200–212. Springer. [cito:citesAsAuthority]
Chen B, Dong X, Jiao D, Wang H, Zhu Q, Ding Y, Wild DJ. Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics. 2010;11:255. doi: 10.1186/1471-2105-11-255. PubMed DOI PMC
Momtchev V, Peychev D, Primov T, Georgiev G (2009) Expanding the pathway and interaction knowledge in linked life data. Semantic Web Challenge: 2009; Amsterdam. [cito:citesAsAuthority]
Willighagen EL, Alvarsson J, Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O, Wikberg JE. Linking the resource description framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2 Suppl 1:6. doi: 10.1186/2041-1480-2-S1-S6. PubMed DOI PMC
Willighagen EL, Waagmeester A, Spjuth O, Ansell P, Williams AJ, Tkachenko V, Hastings J, Chen B, Wild DJ. The ChEMBL database as linked open data. J Cheminform. 2013;5(1):23. doi: 10.1186/1758-2946-5-23. PubMed DOI PMC
Jentzsch A, Zhao J, Hassanzadeh O, Cheung K-H, Samwald M, Andersson B. Linking open drug data. In: I-SEMANTICS. [cito:citesAsAuthority]
Samwald M, Jentzsch A, Bouton C, Kallesoe CS, Willighagen E, Hajagos J, Marshall MS, Prud’hommeaux E, Hassenzadeh O, Pichler E, Stephens S. Linked open drug data for pharmaceutical research and development. J Cheminform. 2011;3(1):19. doi: 10.1186/1758-2946-3-19. PubMed DOI PMC
Williams AJ, Harland L, Groth P, Pettifer S, Chichester C, Willighagen EL, Evelo CT, Blomberg N, Ecker G, Goble C, Mons B. Open PHACTS: semantic interoperability for drug discovery. Drug Discov Today. 2012;17(21–22):1188–98. doi: 10.1016/j.drudis.2012.05.016. PubMed DOI
The UniProt C. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2017;45(D1):158–169. doi: 10.1093/nar/gkw1099. PubMed DOI PMC
Fu G, Batchelor C, Dumontier M, Hastings J, Willighagen E, Bolton E. PubChemRDF: towards the semantic annotation of PubChem compound and substance databases. J Cheminform. 2015;7:34. doi: 10.1186/s13321-015-0084-4. PubMed DOI PMC
Bento AP, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M, Kruger FA, Light Y, Mak L, McGlinchey S, Nowotka M, Papadatos G, Santos R, Overington JP (2014) The ChEMBL bioactivity database: an update. Nucleic Acids Res 42(Database issue), 1083–1090. 10.1093/nar/gkt1031. [cito:citesAsAuthority] [cito:usesDataFrom] PubMed PMC
Mendez D, Gaulton A, Bento AP, Chambers J, De Veij M, Felix E, Magarinos MP, Mosquera JF, Mutowo P, Nowotka M, Gordillo-Maranon M, Hunter F, Junco L, Mugumbate G, Rodriguez-Lopez M, Atkinson F, Bosc N, Radoux CJ, Segura-Cabrera A, Hersey A, Leach AR. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 2019;47(D1):930–940. doi: 10.1093/nar/gky1075. PubMed DOI PMC
Hastings J, de Matos P, Dekker A, Ennis M, Harsha B, Kale N, Muthukrishnan V, Owen G, Turner S, Williams M, Steinbeck C (2013) The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res 41(Database issue), 456–463. 10.1093/nar/gks1146. [cito:citesAsAuthority] [cito:usesDataFrom] PubMed PMC
Gaudet P, Michel PA, Zahn-Zabal M, Cusin I, Duek PD, Evalet O, Gateau A, Gleizes A, Pereira M, Teixeira D, Zhang Y, Lane L, Bairoch A (2015) The neXtProt knowledgebase on human proteins: current status. Nucleic Acids Res 43(Database issue), 764–70. 10.1093/nar/gku1178. [cito:citesAsAuthority] PubMed PMC
Zahn-Zabal M, Michel PA, Gateau A, Nikitin F, Schaeffer M, Audot E, Gaudet P, Duek PD, Teixeira D, de Laval Rech V, Samarasinghe K, Bairoch A, Lane L. The neXtProt knowledgebase in 2020: data, tools and usability improvements. Nucleic Acids Res. 2020;48(D1):328–334. doi: 10.1093/nar/gkz995. PubMed DOI PMC
Lombardot T, Morgat A, Axelsen KB, Aimo L, Hyka-Nouspikel N, Niknejad A, Ignatchenko A, Xenarios I, Coudert E, Redaschi N, Bridge A. Updates in Rhea: SPARQLing biochemical reaction data. Nucleic Acids Res. 2019;47(D1):596–600. doi: 10.1093/nar/gky876. PubMed DOI PMC
Kinjo AR, Suzuki H, Yamashita R, Ikegawa Y, Kudou T, Igarashi R, Kengaku Y, Cho H, Standley DM, Nakagawa A, Nakamura H (2012) Protein Data Bank Japan (PDBj): maintaining a structural data archive and resource description framework format. Nucleic Acids Res 40(Database issue), 453–460. 10.1093/nar/gkr811. [cito:citesAsAuthority] PubMed PMC
Kutmon M, Riutta A, Nunes N, Hanspers K, Willighagen EL, Bohler A, Melius J, Waagmeester A, Sinha SR, Miller R, Coort SL, Cirillo E, Smeets B, Evelo CT, Pico AR. WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Res. 2016;44(D1):488–94. doi: 10.1093/nar/gkv1024. PubMed DOI PMC
Pinero J, Queralt-Rosinach N, Bravo A, Deu-Pons J, Bauer-Mehren A, Baron M, Sanz F, Furlong LI. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database (Oxford) 2015;2015:028. doi: 10.1093/database/bav028. PubMed DOI PMC
Pinero J, Ramirez-Anguita JM, Sauch-Pitarch J, Ronzano F, Centeno E, Sanz F, Furlong LI. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 2020;48(D1):845–855. doi: 10.1093/nar/gkz1021. PubMed DOI PMC
Altenhoff AM, Glover NM, Train CM, Kaleb K, Warwick Vesztrocy A, Dylus D, de Farias TM, Zile K, Stevenson C, Long J, Redestig H, Gonnet GH, Dessimoz C. The OMA orthology database in 2018: retrieving evolutionary relationships among all domains of life through richer web and programmatic interfaces. Nucleic Acids Res. 2018;46(D1):477–485. doi: 10.1093/nar/gkx1019. PubMed DOI PMC
Jupp S, Malone J, Bolleman J, Brandizi M, Davies M, Garcia L, Gaulton A, Gehant S, Laibe C, Redaschi N, Wimalaratne SM, Martin M, Le Novere N, Parkinson H, Birney E, Jenkinson AM. The EBI RDF platform: linked open data for the life sciences. Bioinformatics. 2014;30(9):1338–9. doi: 10.1093/bioinformatics/btt765. PubMed DOI PMC
Kawashima S, Katayama T, Hatanaka H, Kushida T, Takagi T (2018) NBDC RDF portal: a comprehensive repository for semantic data in life sciences. Database (Oxford) 2018. 10.1093/database/bay123 ([cito:citesAsAuthority]) PubMed PMC
Abeyruwan S, Vempati UD, Kucuk-McGinty H, Visser U, Koleti A, Mir A, Sakurai K, Chung C, Bittker JA, Clemons PA, Brudz S, Siripala A, Morales AJ, Romacker M, Twomey D, Bureeva S, Lemmon V, Schurer SC (2014) Evolving BioAssay Ontology (BAO): modularization, integration and applications. J Biomed Semantics 5(Suppl 1 Proceedings of the Bio-Ontologies Spec Interest G), 5. 10.1186/2041-1480-5-S1-S5. [cito:citesAsAuthority] [cito:usesDataFrom] PubMed PMC
Natale DA, Arighi CN, Blake JA, Bona J, Chen C, Chen SC, Christie KR, Cowart J, D’Eustachio P, Diehl AD, Drabkin HJ, Duncan WD, Huang H, Ren J, Ross K, Ruttenberg A, Shamovsky V, Smith B, Wang Q, Zhang J, El-Sayed A, Wu CH. Protein Ontology (PRO): enhancing and scaling up the representation of protein entities. Nucleic Acids Res. 2017;45(D1):339–346. doi: 10.1093/nar/gkw1075. PubMed DOI PMC
The Gene Ontology C (2019) The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res 47(D1):330–338. 10.1093/nar/gky1055. [cito:citesAsAuthority] [cito:usesDataFrom] PubMed PMC
Bushman B, Anderson D, Fu G. Transforming the medical subject headings into linked data: creating the authorized version of MeSH in RDF. J Libr Metadata. 2015;15(3–4):157–176. doi: 10.1080/19386389.2015.1099967. PubMed DOI PMC
Hastings J, Chepelev L, Willighagen E, Adams N, Steinbeck C, Dumontier M. The chemical information ontology: provenance and disambiguation for chemical data on the biological semantic web. PLoS One. 2011;6(10):25513. doi: 10.1371/journal.pone.0025513. PubMed DOI PMC
Dumontier M, Baker CJ, Baran J, Callahan A, Chepelev L, Cruz-Toledo J, Del Rio NR, Duck G, Furlong LI, Keath N, Klassen D, McCusker JP, Queralt-Rosinach N, Samwald M, Villanueva-Rosales N, Wilkinson MD, Hoehndorf R. The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery. J Biomed Semantics. 2014;5(1):14. doi: 10.1186/2041-1480-5-14. PubMed DOI PMC
Ison J, Kalas M, Jonassen I, Bolser D, Uludag M, McWilliam H, Malone J, Lopez R, Pettifer S, Rice P. EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics. 2013;29(10):1325–32. doi: 10.1093/bioinformatics/btt113. PubMed DOI PMC
Whetzel PL, Noy NF, Shah NH, Alexander PR, Nyulas C, Tudorache T, Musen MA. BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic Acids Res. 2011;39(Web Server issue):541–5. doi: 10.1093/nar/gkr469. PubMed DOI PMC
Board DU (2020) DCMI Metadata Terms. [cito:citesAsAuthority] [cito:usesDataFrom] . https://www.dublincore.org/specifications/dublin-core/dcmi-terms/2020-01-20/
Peroni S, Shotton D. FaBiO and CiTO: ontologies for describing bibliographic resources and citations. J Web Semantics. 2012;17:33–43. doi: 10.1016/j.websem.2012.08.001. DOI
Baker T, Bechhofer S, Isaac A, Miles A, Schreiber G, Summers E. Key choices in the design of Simple Knowledge Organization System (SKOS) J Web Semantics. 2013;20:35–49. doi: 10.1016/j.websem.2013.05.001. DOI
Gray AJG, Baran J, Marshall MS, Dumontier M (2015) Dataset Descriptions: HCLS Community Profile. [cito:citesAsAuthority]. https://www.w3.org/TR/2015/NOTE-hcls-dataset-20150514/
Maali F, Erickson J (2014) Data Catalog Vocabulary (DCAT). [cito:citesAsAuthority]. https://www.w3.org/TR/2014/REC-vocab-dcat-20140116/
Alexander K, Cyganiak R, Hausenblas M, Zhao J (2011) Describing Linked Datasets with the VoID Vocabulary. [cito:citesAsAuthority] [cito:usesDataFrom]. https://www.w3.org/TR/2011/NOTE-void-20110303/
Williams GT (2013) SPARQL 1.1 Service Description. [cito:citesAsAuthority]. https://www.w3.org/TR/2013/REC-sparql11-service-description-20130321/
Kratochvil M, Vondrasek J, Galgonek J. Sachem: a chemical cartridge for high-performance substructure search. J Cheminform. 2018;10(1):27. doi: 10.1186/s13321-018-0282-y. PubMed DOI PMC
Kratochvil M, Vondrasek J, Galgonek J. Interoperable chemical structure search service. J Cheminform. 2019;11(1):45. doi: 10.1186/s13321-019-0367-2. PubMed DOI PMC
Winnenburg R, Bodenreider O. Desiderata for an authoritative Representation of MeSH in RDF. AMIA Annu Symp Proc. 2014;2014:1218–27. PubMed PMC
Snorql: A SPARQL Explorer for ChEMBL RDF. https://chemblmirror.rdf.bigcat-bioinformatics.org/
NCBI organismal classification. [cito:usesDataFrom]. http://www.obofoundry.org/ontology/ncbitaxon.html
Llinares MB, Gomez JF, Juty N, Goble C, Wimalaratne SM, Hermjakob H (2020) Identifiers.org - Compact Identifier Services in the Cloud. Bioinformatics. 10.1093/bioinformatics/btaa864. [cito:citesAsAuthority] PubMed PMC
Federhen S (2012) The NCBI Taxonomy database. Nucleic Acids Res 40(Database issue), 136–143. 10.1093/nar/gkr1178. [cito:citesAsAuthority] [cito:usesDataFrom] PubMed PMC
PostgreSQL. [cito:usesMethodIn]. https://www.postgresql.org/about/
Team OSD. Mapping SQL Data to Linked Data Views. [cito:citesAsRelated]. http://vos.openlinksw.com/owiki/wiki/VOS/VOSSQL2RDF
Cyganiak R, Bizer C, Garbers J, Maresch O, Becker C (2012) The D2RQ Mapping Language. [cito:citesAsRelated] . http://d2rq.org/d2rq-language
Das S, Sundara S, Cyganiak R (2012) R2RML: RDB to RDF Mapping Language. [cito:citesAsRelated] . https://www.w3.org/TR/2012/REC-r2rml-20120927/
RDF Views: Relational Data as RDF. [cito:citesAsRelated]. https://docs.oracle.com/en/database/oracle/oracle-database/19/rdfrm/rdf-views.html
Gandon F, Schreiber G (2014) RDF 1.1 XML Syntax. [cito:citesAsAuthority]. https://www.w3.org/TR/2014/REC-rdf-syntax-grammar-20140225/
Prud’hommeaux E, Carothers G (2014) RDF 1.1 Turtle: Terse RDF Triple Language. [cito:citesAsAuthority]. https://www.w3.org/TR/2014/REC-turtle-20140225/
Apache Jena. [cito:usesMethodIn]. https://jena.apache.org/
Clark J, DeRose S (1999) XML Path Language (XPath) Version 1.0. [cito:citesAsAuthority]. https://www.w3.org/TR/1999/REC-xpath-19991116/
Galgonek J, Hurt T, Michlikova V, Onderka P, Schwarz J, Vondrasek J. Advanced SPARQL querying in small molecule databases. J Cheminform. 2016;8:31. doi: 10.1186/s13321-016-0144-4. PubMed DOI PMC
CodeMirror. [cito:usesMethodIn]. https://codemirror.net/
PubChemRDF. [cito:usesDataFrom] [cito:citesAsDataSource]. https://pubchemdocs.ncbi.nlm.nih.gov/rdf
The Apache Velocity Project - User Guide. [cito:usesMethodIn]. https://velocity.apache.org/engine/2.2/user-guide.html
Rhea SPARQL endpoint. [cito:citesAsDataSource] [cito:usesMethodIn]. https://sparql.rhea-db.org/sparql
UniProt. [cito:usesMethodIn]. https://sparql.uniprot.org/
neXtProt. [cito:citesAsDataSource] [cito:usesMethodIn]. https://www.nextprot.org/
The IDSM mass spectrometry extension: searching mass spectra using SPARQL
Web of venom: exploration of big data resources in animal toxin research
Fully automated virtual screening pipeline of FDA-approved drugs using Caver Web