IDSM ChemWebRDF: SPARQLing small-molecule datasets

. 2021 May 12 ; 13 (1) : 38. [epub] 20210512

Status PubMed-not-MEDLINE Jazyk angličtina Země Anglie, Velká Británie Médium electronic

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid33980298

Grantová podpora
LM2018131 Ministerstvo Školství, Mládeže a Telovýchovy
RVO:61388963 Ústav Organické Chemie a Biochemie, Akademie Ved Ceské Republiky

Odkazy

PubMed 33980298
PubMed Central PMC8117646
DOI 10.1186/s13321-021-00515-1
PII: 10.1186/s13321-021-00515-1
Knihovny.cz E-zdroje

The Resource Description Framework (RDF), together with well-defined ontologies, significantly increases data interoperability and usability. The SPARQL query language was introduced to retrieve requested RDF data and to explore links between them. Among other useful features, SPARQL supports federated queries that combine multiple independent data source endpoints. This allows users to obtain insights that are not possible using only a single data source. Owing to all of these useful features, many biological and chemical databases present their data in RDF, and support SPARQL querying. In our project, we primary focused on PubChem, ChEMBL and ChEBI small-molecule datasets. These datasets are already being exported to RDF by their creators. However, none of them has an official and currently supported SPARQL endpoint. This omission makes it difficult to construct complex or federated queries that could access all of the datasets, thus underutilising the main advantage of the availability of RDF data. Our goal is to address this gap by integrating the datasets into one database called the Integrated Database of Small Molecules (IDSM) that will be accessible through a SPARQL endpoint. Beyond that, we will also focus on increasing mutual interoperability of the datasets. To realise the endpoint, we decided to implement an in-house developed SPARQL engine based on the PostgreSQL relational database for data storage. In our approach, data are stored in the traditional relational form, and the SPARQL engine translates incoming SPARQL queries into equivalent SQL queries. An important feature of the engine is that it optimises the resulting SQL queries. Together with optimisations performed by PostgreSQL, this allows efficient evaluations of SPARQL queries. The endpoint provides not only querying in the dataset, but also the compound substructure and similarity search supported by our Sachem project. Although the endpoint is accessible from an internet browser, it is mainly intended to be used for programmatic access by other services, for example as a part of federated queries. For regular users, we offer a rich web application called ChemWebRDF using the endpoint. The application is publicly available at https://idsm.elixir-czech.cz/chemweb/ .

Zobrazit více v PubMed

Berners-Lee T (2009) Linked Data. [cito:citesAsAuthority]. https://www.w3.org/DesignIssues/LinkedData.html

Cyganiak R, Wood D, Lanthaler M (2014) RDF 1.1 Concepts and Abstract Syntax. [cito:citesAsAuthority]. https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/

Schreiber G, Raimond, Y (2014) RDF 1.1 Primer. [cito:citesAsAuthority]. https://www.w3.org/TR/2014/NOTE-rdf11-primer-20140624/

Duerst M, Suignard M (2005) Internationalized Resource Identifiers (IRIs). [cito:citesAsAuthority]. https://tools.ietf.org/html/rfc3987

Brickley D, Guha RV (2014) RDF Schema 1.1. [cito:citesAsAuthority] . https://www.w3.org/TR/2014/REC-rdf-schema-20140225/

Group WOW (2012) OWL 2 Web Ontology Language Document Overview (Second Edition). [cito:citesAsAuthority]. https://www.w3.org/TR/2012/REC-owl2-overview-20121211/

Harris S, Seaborne A (2013) SPARQL 1.1 Query Language. [cito:citesAsAuthority] . https://www.w3.org/TR/2013/REC-sparql11-query-20130321/

Prud’hommeaux E, Buil-Aranda C (2013) SPARQL 1.1 Federated Query. [cito:citesAsAuthority] . https://www.w3.org/TR/2013/REC-sparql11-federated-query-20130321/

Feigenbaum L, Williams GT, Clark KG, Torres E (2013) SPARQL 1.1 Protocol. [cito:citesAsAuthority]. https://www.w3.org/TR/2013/REC-sparql11-protocol-20130321/

Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette J. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J Biomed Inform. 2008;41(5):706–16. doi: 10.1016/j.jbi.2008.03.004. PubMed DOI

Callahan A, Cruz-Toledo J, Ansell P, Dumontier M. Bio2RDF release 2: improved coverage, interoperability and provenance of life science linked data. The semantic web: semantics and big data, pp 200–212. Springer. [cito:citesAsAuthority]

Chen B, Dong X, Jiao D, Wang H, Zhu Q, Ding Y, Wild DJ. Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics. 2010;11:255. doi: 10.1186/1471-2105-11-255. PubMed DOI PMC

Momtchev V, Peychev D, Primov T, Georgiev G (2009) Expanding the pathway and interaction knowledge in linked life data. Semantic Web Challenge: 2009; Amsterdam. [cito:citesAsAuthority]

Willighagen EL, Alvarsson J, Andersson A, Eklund M, Lampa S, Lapins M, Spjuth O, Wikberg JE. Linking the resource description framework to cheminformatics and proteochemometrics. J Biomed Semantics. 2011;2 Suppl 1:6. doi: 10.1186/2041-1480-2-S1-S6. PubMed DOI PMC

Willighagen EL, Waagmeester A, Spjuth O, Ansell P, Williams AJ, Tkachenko V, Hastings J, Chen B, Wild DJ. The ChEMBL database as linked open data. J Cheminform. 2013;5(1):23. doi: 10.1186/1758-2946-5-23. PubMed DOI PMC

Jentzsch A, Zhao J, Hassanzadeh O, Cheung K-H, Samwald M, Andersson B. Linking open drug data. In: I-SEMANTICS. [cito:citesAsAuthority]

Samwald M, Jentzsch A, Bouton C, Kallesoe CS, Willighagen E, Hajagos J, Marshall MS, Prud’hommeaux E, Hassenzadeh O, Pichler E, Stephens S. Linked open drug data for pharmaceutical research and development. J Cheminform. 2011;3(1):19. doi: 10.1186/1758-2946-3-19. PubMed DOI PMC

Williams AJ, Harland L, Groth P, Pettifer S, Chichester C, Willighagen EL, Evelo CT, Blomberg N, Ecker G, Goble C, Mons B. Open PHACTS: semantic interoperability for drug discovery. Drug Discov Today. 2012;17(21–22):1188–98. doi: 10.1016/j.drudis.2012.05.016. PubMed DOI

The UniProt C. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2017;45(D1):158–169. doi: 10.1093/nar/gkw1099. PubMed DOI PMC

Fu G, Batchelor C, Dumontier M, Hastings J, Willighagen E, Bolton E. PubChemRDF: towards the semantic annotation of PubChem compound and substance databases. J Cheminform. 2015;7:34. doi: 10.1186/s13321-015-0084-4. PubMed DOI PMC

Bento AP, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M, Kruger FA, Light Y, Mak L, McGlinchey S, Nowotka M, Papadatos G, Santos R, Overington JP (2014) The ChEMBL bioactivity database: an update. Nucleic Acids Res 42(Database issue), 1083–1090. 10.1093/nar/gkt1031. [cito:citesAsAuthority] [cito:usesDataFrom] PubMed PMC

Mendez D, Gaulton A, Bento AP, Chambers J, De Veij M, Felix E, Magarinos MP, Mosquera JF, Mutowo P, Nowotka M, Gordillo-Maranon M, Hunter F, Junco L, Mugumbate G, Rodriguez-Lopez M, Atkinson F, Bosc N, Radoux CJ, Segura-Cabrera A, Hersey A, Leach AR. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 2019;47(D1):930–940. doi: 10.1093/nar/gky1075. PubMed DOI PMC

Hastings J, de Matos P, Dekker A, Ennis M, Harsha B, Kale N, Muthukrishnan V, Owen G, Turner S, Williams M, Steinbeck C (2013) The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res 41(Database issue), 456–463. 10.1093/nar/gks1146. [cito:citesAsAuthority] [cito:usesDataFrom] PubMed PMC

Gaudet P, Michel PA, Zahn-Zabal M, Cusin I, Duek PD, Evalet O, Gateau A, Gleizes A, Pereira M, Teixeira D, Zhang Y, Lane L, Bairoch A (2015) The neXtProt knowledgebase on human proteins: current status. Nucleic Acids Res 43(Database issue), 764–70. 10.1093/nar/gku1178. [cito:citesAsAuthority] PubMed PMC

Zahn-Zabal M, Michel PA, Gateau A, Nikitin F, Schaeffer M, Audot E, Gaudet P, Duek PD, Teixeira D, de Laval Rech V, Samarasinghe K, Bairoch A, Lane L. The neXtProt knowledgebase in 2020: data, tools and usability improvements. Nucleic Acids Res. 2020;48(D1):328–334. doi: 10.1093/nar/gkz995. PubMed DOI PMC

Lombardot T, Morgat A, Axelsen KB, Aimo L, Hyka-Nouspikel N, Niknejad A, Ignatchenko A, Xenarios I, Coudert E, Redaschi N, Bridge A. Updates in Rhea: SPARQLing biochemical reaction data. Nucleic Acids Res. 2019;47(D1):596–600. doi: 10.1093/nar/gky876. PubMed DOI PMC

Kinjo AR, Suzuki H, Yamashita R, Ikegawa Y, Kudou T, Igarashi R, Kengaku Y, Cho H, Standley DM, Nakagawa A, Nakamura H (2012) Protein Data Bank Japan (PDBj): maintaining a structural data archive and resource description framework format. Nucleic Acids Res 40(Database issue), 453–460. 10.1093/nar/gkr811. [cito:citesAsAuthority] PubMed PMC

Kutmon M, Riutta A, Nunes N, Hanspers K, Willighagen EL, Bohler A, Melius J, Waagmeester A, Sinha SR, Miller R, Coort SL, Cirillo E, Smeets B, Evelo CT, Pico AR. WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Res. 2016;44(D1):488–94. doi: 10.1093/nar/gkv1024. PubMed DOI PMC

Pinero J, Queralt-Rosinach N, Bravo A, Deu-Pons J, Bauer-Mehren A, Baron M, Sanz F, Furlong LI. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database (Oxford) 2015;2015:028. doi: 10.1093/database/bav028. PubMed DOI PMC

Pinero J, Ramirez-Anguita JM, Sauch-Pitarch J, Ronzano F, Centeno E, Sanz F, Furlong LI. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 2020;48(D1):845–855. doi: 10.1093/nar/gkz1021. PubMed DOI PMC

Altenhoff AM, Glover NM, Train CM, Kaleb K, Warwick Vesztrocy A, Dylus D, de Farias TM, Zile K, Stevenson C, Long J, Redestig H, Gonnet GH, Dessimoz C. The OMA orthology database in 2018: retrieving evolutionary relationships among all domains of life through richer web and programmatic interfaces. Nucleic Acids Res. 2018;46(D1):477–485. doi: 10.1093/nar/gkx1019. PubMed DOI PMC

Jupp S, Malone J, Bolleman J, Brandizi M, Davies M, Garcia L, Gaulton A, Gehant S, Laibe C, Redaschi N, Wimalaratne SM, Martin M, Le Novere N, Parkinson H, Birney E, Jenkinson AM. The EBI RDF platform: linked open data for the life sciences. Bioinformatics. 2014;30(9):1338–9. doi: 10.1093/bioinformatics/btt765. PubMed DOI PMC

Kawashima S, Katayama T, Hatanaka H, Kushida T, Takagi T (2018) NBDC RDF portal: a comprehensive repository for semantic data in life sciences. Database (Oxford) 2018. 10.1093/database/bay123 ([cito:citesAsAuthority]) PubMed PMC

Abeyruwan S, Vempati UD, Kucuk-McGinty H, Visser U, Koleti A, Mir A, Sakurai K, Chung C, Bittker JA, Clemons PA, Brudz S, Siripala A, Morales AJ, Romacker M, Twomey D, Bureeva S, Lemmon V, Schurer SC (2014) Evolving BioAssay Ontology (BAO): modularization, integration and applications. J Biomed Semantics 5(Suppl 1 Proceedings of the Bio-Ontologies Spec Interest G), 5. 10.1186/2041-1480-5-S1-S5. [cito:citesAsAuthority] [cito:usesDataFrom] PubMed PMC

Natale DA, Arighi CN, Blake JA, Bona J, Chen C, Chen SC, Christie KR, Cowart J, D’Eustachio P, Diehl AD, Drabkin HJ, Duncan WD, Huang H, Ren J, Ross K, Ruttenberg A, Shamovsky V, Smith B, Wang Q, Zhang J, El-Sayed A, Wu CH. Protein Ontology (PRO): enhancing and scaling up the representation of protein entities. Nucleic Acids Res. 2017;45(D1):339–346. doi: 10.1093/nar/gkw1075. PubMed DOI PMC

The Gene Ontology C (2019) The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res 47(D1):330–338. 10.1093/nar/gky1055. [cito:citesAsAuthority] [cito:usesDataFrom] PubMed PMC

Bushman B, Anderson D, Fu G. Transforming the medical subject headings into linked data: creating the authorized version of MeSH in RDF. J Libr Metadata. 2015;15(3–4):157–176. doi: 10.1080/19386389.2015.1099967. PubMed DOI PMC

Hastings J, Chepelev L, Willighagen E, Adams N, Steinbeck C, Dumontier M. The chemical information ontology: provenance and disambiguation for chemical data on the biological semantic web. PLoS One. 2011;6(10):25513. doi: 10.1371/journal.pone.0025513. PubMed DOI PMC

Dumontier M, Baker CJ, Baran J, Callahan A, Chepelev L, Cruz-Toledo J, Del Rio NR, Duck G, Furlong LI, Keath N, Klassen D, McCusker JP, Queralt-Rosinach N, Samwald M, Villanueva-Rosales N, Wilkinson MD, Hoehndorf R. The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery. J Biomed Semantics. 2014;5(1):14. doi: 10.1186/2041-1480-5-14. PubMed DOI PMC

Ison J, Kalas M, Jonassen I, Bolser D, Uludag M, McWilliam H, Malone J, Lopez R, Pettifer S, Rice P. EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics. 2013;29(10):1325–32. doi: 10.1093/bioinformatics/btt113. PubMed DOI PMC

Whetzel PL, Noy NF, Shah NH, Alexander PR, Nyulas C, Tudorache T, Musen MA. BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic Acids Res. 2011;39(Web Server issue):541–5. doi: 10.1093/nar/gkr469. PubMed DOI PMC

Board DU (2020) DCMI Metadata Terms. [cito:citesAsAuthority] [cito:usesDataFrom] . https://www.dublincore.org/specifications/dublin-core/dcmi-terms/2020-01-20/

Peroni S, Shotton D. FaBiO and CiTO: ontologies for describing bibliographic resources and citations. J Web Semantics. 2012;17:33–43. doi: 10.1016/j.websem.2012.08.001. DOI

Baker T, Bechhofer S, Isaac A, Miles A, Schreiber G, Summers E. Key choices in the design of Simple Knowledge Organization System (SKOS) J Web Semantics. 2013;20:35–49. doi: 10.1016/j.websem.2013.05.001. DOI

Gray AJG, Baran J, Marshall MS, Dumontier M (2015) Dataset Descriptions: HCLS Community Profile. [cito:citesAsAuthority]. https://www.w3.org/TR/2015/NOTE-hcls-dataset-20150514/

Maali F, Erickson J (2014) Data Catalog Vocabulary (DCAT). [cito:citesAsAuthority]. https://www.w3.org/TR/2014/REC-vocab-dcat-20140116/

Alexander K, Cyganiak R, Hausenblas M, Zhao J (2011) Describing Linked Datasets with the VoID Vocabulary. [cito:citesAsAuthority] [cito:usesDataFrom]. https://www.w3.org/TR/2011/NOTE-void-20110303/

Williams GT (2013) SPARQL 1.1 Service Description. [cito:citesAsAuthority]. https://www.w3.org/TR/2013/REC-sparql11-service-description-20130321/

Kratochvil M, Vondrasek J, Galgonek J. Sachem: a chemical cartridge for high-performance substructure search. J Cheminform. 2018;10(1):27. doi: 10.1186/s13321-018-0282-y. PubMed DOI PMC

Kratochvil M, Vondrasek J, Galgonek J. Interoperable chemical structure search service. J Cheminform. 2019;11(1):45. doi: 10.1186/s13321-019-0367-2. PubMed DOI PMC

Winnenburg R, Bodenreider O. Desiderata for an authoritative Representation of MeSH in RDF. AMIA Annu Symp Proc. 2014;2014:1218–27. PubMed PMC

Snorql: A SPARQL Explorer for ChEMBL RDF. https://chemblmirror.rdf.bigcat-bioinformatics.org/

NCBI organismal classification. [cito:usesDataFrom]. http://www.obofoundry.org/ontology/ncbitaxon.html

Llinares MB, Gomez JF, Juty N, Goble C, Wimalaratne SM, Hermjakob H (2020) Identifiers.org - Compact Identifier Services in the Cloud. Bioinformatics. 10.1093/bioinformatics/btaa864. [cito:citesAsAuthority] PubMed PMC

Federhen S (2012) The NCBI Taxonomy database. Nucleic Acids Res 40(Database issue), 136–143. 10.1093/nar/gkr1178. [cito:citesAsAuthority] [cito:usesDataFrom] PubMed PMC

PostgreSQL. [cito:usesMethodIn]. https://www.postgresql.org/about/

Team OSD. Mapping SQL Data to Linked Data Views. [cito:citesAsRelated]. http://vos.openlinksw.com/owiki/wiki/VOS/VOSSQL2RDF

Cyganiak R, Bizer C, Garbers J, Maresch O, Becker C (2012) The D2RQ Mapping Language. [cito:citesAsRelated] . http://d2rq.org/d2rq-language

Das S, Sundara S, Cyganiak R (2012) R2RML: RDB to RDF Mapping Language. [cito:citesAsRelated] . https://www.w3.org/TR/2012/REC-r2rml-20120927/

RDF Views: Relational Data as RDF. [cito:citesAsRelated]. https://docs.oracle.com/en/database/oracle/oracle-database/19/rdfrm/rdf-views.html

Gandon F, Schreiber G (2014) RDF 1.1 XML Syntax. [cito:citesAsAuthority]. https://www.w3.org/TR/2014/REC-rdf-syntax-grammar-20140225/

Prud’hommeaux E, Carothers G (2014) RDF 1.1 Turtle: Terse RDF Triple Language. [cito:citesAsAuthority]. https://www.w3.org/TR/2014/REC-turtle-20140225/

Apache Jena. [cito:usesMethodIn]. https://jena.apache.org/

Clark J, DeRose S (1999) XML Path Language (XPath) Version 1.0. [cito:citesAsAuthority]. https://www.w3.org/TR/1999/REC-xpath-19991116/

Galgonek J, Hurt T, Michlikova V, Onderka P, Schwarz J, Vondrasek J. Advanced SPARQL querying in small molecule databases. J Cheminform. 2016;8:31. doi: 10.1186/s13321-016-0144-4. PubMed DOI PMC

CodeMirror. [cito:usesMethodIn]. https://codemirror.net/

PubChemRDF. [cito:usesDataFrom] [cito:citesAsDataSource]. https://pubchemdocs.ncbi.nlm.nih.gov/rdf

The Apache Velocity Project - User Guide. [cito:usesMethodIn]. https://velocity.apache.org/engine/2.2/user-guide.html

Rhea SPARQL endpoint. [cito:citesAsDataSource] [cito:usesMethodIn]. https://sparql.rhea-db.org/sparql

UniProt. [cito:usesMethodIn]. https://sparql.uniprot.org/

neXtProt. [cito:citesAsDataSource] [cito:usesMethodIn]. https://www.nextprot.org/

Najít záznam

Citační ukazatele

Nahrávání dat ...

Možnosti archivace

Nahrávání dat ...