DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications

. 2023 Aug 19 ; 14 (1) : 5045. [epub] 20230819

Status PubMed-not-MEDLINE Jazyk angličtina Země Anglie, Velká Británie Médium electronic

Typ dokumentu časopisecké články

Perzistentní odkaz   https://www.medvik.cz/link/pmid37598180

Grantová podpora
239748522 Deutsche Forschungsgemeinschaft (German Research Foundation)
239748522 Deutsche Forschungsgemeinschaft (German Research Foundation)
LM2023052 Ministerstvo Školství, Mládeže a Tělovýchovy (Ministry of Education, Youth and Sports)

Odkazy

PubMed 37598180
PubMed Central PMC10439916
DOI 10.1038/s41467-023-40782-0
PII: 10.1038/s41467-023-40782-0
Knihovny.cz E-zdroje

The number of publications describing chemical structures has increased steadily over the last decades. However, the majority of published chemical information is currently not available in machine-readable form in public databases. It remains a challenge to automate the process of information extraction in a way that requires less manual intervention - especially the mining of chemical structure depictions. As an open-source platform that leverages recent advancements in deep learning, computer vision, and natural language processing, DECIMER.ai (Deep lEarning for Chemical IMagE Recognition) strives to automatically segment, classify, and translate chemical structure depictions from the printed literature. The segmentation and classification tools are the only openly available packages of their kind, and the optical chemical structure recognition (OCSR) core application yields outstanding performance on all benchmark datasets. The source code, the trained models and the datasets developed in this work have been published under permissive licences. An instance of the DECIMER web application is available at https://decimer.ai .

Erratum v

PubMed

Zobrazit více v PubMed

Brinkhaus HO, Rajan K, Schaub J, Zielesny A, Steinbeck C. Open data and algorithms for open science in AI-driven molecular informatics. Curr. Opin. Struct. Biol. 2023;79:102542. doi: 10.1016/j.sbi.2023.102542. PubMed DOI

Herres-Pawlis S, Liermann JC, Koepler O. Research data in chemistry–results of the first NFDI4Chem community survey. Z. Anorg. Allg. Chem. 2020;646:1748–1757. doi: 10.1002/zaac.202000339. DOI

Steinbeck C, et al. NFDI4Chem-towards a national research data infrastructure for chemistry in Germany. Riogrande Odontol. 2020;6:e55852.

NFDI4Chem. nmrXiv-Open, FAIR and Consensus-Driven NMR spectroscopy data repository and analysis platform. nmrXiv-Open, FAIR and Consensus-Driven NMR Spectroscopy Data Repository and Analysis Platform.https://nmrxiv.org/ (2022).

Kearnes SM, et al. The open reaction database. J. Am. Chem. Soc. 2021;143:18820–18826. doi: 10.1021/jacs.1c09820. PubMed DOI

Kim S, et al. PubChem protein, gene, pathway, and taxonomy data collections: bridging biology and chemistry through target-centric views of PubChem Data. J. Mol. Biol. 2022;434:167514. doi: 10.1016/j.jmb.2022.167514. PubMed DOI PMC

wwPDB consortium. Protein data bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2019;47:D520–D528. doi: 10.1093/nar/gky949. PubMed DOI PMC

Swain MC, Cole JM. ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature. J. Chem. Inf. Model. 2016;56:1894–1904. doi: 10.1021/acs.jcim.6b00207. PubMed DOI

Contreras ML, Leonor Contreras M, Allendes C, Tomas Alvarez L, Rozas R. Computational perception and recognition of digitized molecular structures. J. Chem. Inf. Model. 1990;30:302–307.

Rozas R, Fernandez H. Automatic processing of graphics for image databases in science. J. Chem. Inf. Comput. Sci. 1990;30:7–12. doi: 10.1021/ci00065a003. DOI

McDaniel JR, Balmuth JR. Kekule: OCR-optical chemical (structure) recognition. J. Chem. Inf. Comput. Sci. 1992;32:373–378. doi: 10.1021/ci00008a018. DOI

Filippov IV, Nicklaus MC. Optical structure recognition software to recover chemical information: OSRA, an open source solution. J. Chem. Inf. Model. 2009;49:740–743. doi: 10.1021/ci800067r. PubMed DOI PMC

Smolov, V., Zentsev, F. & Rybalkin, M. Imago: open-source toolkit for 2D chemical structure image recognition. Proceedings of Text Retrieval Conference (Gaithersburg, Maryland, USA, 2011).

Peryea, T., Katzel, D., Zhao, T., Southall, N. & Nguyen, D.-T. MOLVEC: Open source library for chemical structure recognition. Abstr. Pap. Am. Chem. Soc.258, (2019).

Clevert, D.-A., Le, T., Winter, R. & Montanari, F. Img2Mol-Accurate SMILES Recognition from Molecular Graphical Depictions. Chem. Sci. 10.1039/D1SC01839F (2021). PubMed PMC

Staker J, Marshall K, Abel R, McQuaw CM. Molecular structure extraction from documents using deep learning. J. Chem. Inf. Model. 2019;59:1017–1029. doi: 10.1021/acs.jcim.8b00669. PubMed DOI

Rajan K, Zielesny A, Steinbeck C. DECIMER: towards deep learning for chemical image recognition. J. Cheminform. 2020;12:65. doi: 10.1186/s13321-020-00469-w. PubMed DOI PMC

Rajan K, Zielesny A, Steinbeck C. DECIMER 1.0: deep learning for chemical image recognition using transformers. J. Cheminform. 2021;13:61. doi: 10.1186/s13321-021-00538-8. PubMed DOI PMC

Rajan K, Brinkhaus HO, Zielesny A, Steinbeck C. A review of optical chemical structure recognition tools. J. Cheminform. 2020;12:60. doi: 10.1186/s13321-020-00465-0. PubMed DOI PMC

Musazade F, Jamalova N, Hasanov J. Review of techniques and models used in optical chemical structure recognition in images and scanned documents. J. Cheminform. 2022;14:61. doi: 10.1186/s13321-022-00642-3. PubMed DOI PMC

Oldenhof M, Arany A, Moreau Y, Simm J. ChemGrapher: optical graph recognition of chemical compounds by deep learning. J. Chem. Inf. Model. 2020;60:4506–4517. doi: 10.1021/acs.jcim.0c00459. PubMed DOI

Khokhlov, I., Krasnov, L., Fedorov, M. V. & Sosnin, S. Image2SMILES: Transformer‐based molecular optical recognition engine. Chem. Methods2, 1 e202100069 (2022).

Xu Y, et al. MolMiner: you only look once for chemical structure recognition. J. Chem. Inf. Model. 2022;62:5321–5328. doi: 10.1021/acs.jcim.2c00733. PubMed DOI PMC

Xu Z, Li J, Yang Z, Li S, Li H. SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer. J. Cheminform. 2022;14:41. doi: 10.1186/s13321-022-00624-5. PubMed DOI PMC

He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. https://arxiv.org/abs/1703.06870 (2017). PubMed

Rajan K, Brinkhaus HO, Sorokina M, Zielesny A, Steinbeck C. DECIMER-segmentation: automated extraction of chemical structure depictions from scientific literature. J. Cheminform. 2021;13:20. doi: 10.1186/s13321-021-00496-1. PubMed DOI PMC

DECIMER Web Application. https://decimer.ai (2023).

Willighagen EL, et al. The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J. Cheminform. 2017;9:33. doi: 10.1186/s13321-017-0220-4. PubMed DOI PMC

Landrum, G. & Others. RDKit: Open-Source Cheminformatics Software. (2016). https://github.com/rdkit/rdkit (2016).

Indigo Toolkit. https://lifescience.opensource.epam.com/indigo/ (2009).

Terlouw BR, Vromans SPJM, Medema MH. PIKAChU: a Python-based informatics kit for analysing chemical units. J. Cheminform. 2022;14:34. doi: 10.1186/s13321-022-00616-5. PubMed DOI PMC

Tanimoto, T.T. Elementary Mathematical Theory of Classification and Prediction. (International Business Machines Corporation, 1958).

Jaccard P. The distribution of the flora in the alpine zone.1. New Phytol. 1912;11:37–50. doi: 10.1111/j.1469-8137.1912.tb05611.x. DOI

Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318 (2002).

Qian, Y., Tu, Z., Guo, J., Coley, C. W. & Barzilay, R. Robust molecular image recognition: a graph generation approach. https://arxiv.org/abs/2205.14311 (2022). PubMed

Karulin B, Kozhevnikov M. Ketcher: web-based chemical structure editor. J. Cheminform. 2011;3:1. doi: 10.1186/1758-2946-3-S1-P3. PubMed DOI

Brinkhaus HO, Rajan K, Zielesny A, Steinbeck C. RanDepict: random chemical structure depiction generator. J. Cheminform. 2022;14:31. doi: 10.1186/s13321-022-00609-4. PubMed DOI PMC

Zhang, X.-C. et al. ABC-Net: a divide-and-conquer based deep learning architecture for SMILES recognition from molecular images. Brief. Bioinform. 23, bbac033 (2022). PubMed

Hormazabal R, et al. CEDe: a collection of expert-curated datasets with atom-level entity annotations for optical chemical structure recognition. Adv. Neural Inf. Process. Syst. 2022;35:27114–27126.

Valko AT, Johnson AP. CLiDE Pro: the latest generation of CLiDE, a tool for optical chemical structure recognition. J. Chem. Inf. Model. 2009;49:780–787. doi: 10.1021/ci800449t. PubMed DOI

Mavračić J, Court CJ, Isazawa T, Elliott SR, Cole JM. ChemDataExtractor 2.0: autopopulated ontologies for materials science. J. Chem. Inf. Model. 2021;61:4280–4289. doi: 10.1021/acs.jcim.1c00446. PubMed DOI

Isazawa T, Cole JM. Single model for organic and inorganic chemical named entity recognition in ChemDataExtractor. J. Chem. Inf. Model. 2022;62:1207–1213. doi: 10.1021/acs.jcim.1c01199. PubMed DOI PMC

Beard EJ, Sivaraman G, Vázquez-Mayagoitia Á, Vishwanath V, Cole JM. Comparative dataset of experimental and computational attributes of UV/vis absorption spectra. Sci. Data. 2019;6:307. doi: 10.1038/s41597-019-0306-0. PubMed DOI PMC

Court CJ, Cole JM. Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction. Sci. Data. 2018;5:180111. doi: 10.1038/sdata.2018.111. PubMed DOI PMC

Beard EJ, Cole JM. Perovskite- and dye-sensitized solar-cell device databases auto-generated using ChemDataExtractor. Sci. Data. 2022;9:329. doi: 10.1038/s41597-022-01355-w. PubMed DOI PMC

Huang S, Cole JM. A database of battery materials auto-generated using ChemDataExtractor. Sci. Data. 2020;7:260. doi: 10.1038/s41597-020-00602-2. PubMed DOI PMC

Decimer-segmentation. PyPIhttps://pypi.org/project/decimer-segmentation/ (2023).

Rajan, K., Brinkhaus, H. O., Sorokina, M., Zielesny, A. & Steinbeck, C. DECIMER-Image-Segmentation-GitHub.https://decimer.ai/ (2022).

Rajan, K., Brinkhaus, H. O., Zielesny, A. & Steinbeck, C. DECIMER-Segmentation model. 10.5281/ZENODO.7228583 (2021).

Kim S, et al. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res. 2021;49:D1388–D1395. doi: 10.1093/nar/gkaa971. PubMed DOI PMC

Ashton M, et al. Identification of diverse database subsets using property-based and fragment-based molecular descriptions. Quant. Struct. Act. Relatsh. 2002;21:598–604. doi: 10.1002/qsar.200290002. DOI

Dalke A. The chemfp project. J. Cheminform. 2019;11:76. doi: 10.1186/s13321-019-0398-8. PubMed DOI PMC

O’Boyle, N. & Dalke, A. DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. ChemRxiv10.26434/chemrxiv.7097960.v1 (2018).

Krenn M, Häse F, Nigam A, Friederich P, Aspuru-Guzik A. Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach. Learn. Sci. Technol. 2020;1:045024. doi: 10.1088/2632-2153/aba947. DOI

Rajan K, Steinbeck C, Zielesny A. Performance of chemical structure string representations for chemical image recognition using transformers. Digit. Discov. 2022;1:84–90. doi: 10.1039/D1DD00013F. DOI

Chollet, F. & Others. Keras. https://keras.io (2015).

Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. https://arxiv.org/abs/1603.04467 (2016).

Weir H, et al. ChemPix: automated recognition of hand-drawn hydrocarbon structures using deep learning. Chem. Sci. 2021;12:10622–10633. doi: 10.1039/D1SC02957F. PubMed DOI PMC

Vaswani, A. et al. Attention Is All You Need. https://arxiv.org/abs/1706.03762 (2017).

Tan, M. & Le, Q. V. EfficientNetV2: smaller models and faster training. https://arxiv.org/abs/2104.00298 (2021).

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.-C. MobileNetV2: inverted residuals and linear bottlenecks. https://arxiv.org/abs/1801.04381 (2018).

Gupta S., & Tan, M. Efficientnet-edgetpu: creating accelerator-optimized neural networks with automl. https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html (2019).

Rajan, K. OCSR_Review: This Repository Contains the Information Related to the Benchmark Study on Openly Available OCSR tools. (Github) (2020).

OSRA validation datasets. https://sourceforge.net/p/osra/wiki/Validation/ (Accessed 2023).

Sadawi, N. M., Sexton, A. P. & Sorge, V. in Document Recognition and Retrieval XIX. Vol. 8297. 101–109 (SPIE, 2012).

Website. https://www.ifs.tuwien.ac.at/~clef-ip/download/2012/qrels/clef-ip-2012-chem-recognition-qrels.tgz (2022).

Brinkhaus HO, Zielesny A, Steinbeck C, Rajan K. DECIMER-hand-drawn molecule images dataset. J. Cheminform. 2022;14:36. doi: 10.1186/s13321-022-00620-9. PubMed DOI PMC

docker-osra: OSRA (Optical Structure Recognition Application) in Docker. (Github, 2022).

Docker. https://hub.docker.com/repository/docker/obrink/osra (2022).

Molvec JAR 0.9.8. https://jar-download.com/artifacts/gov.nih.ncats/molvec/0.9.8/source-code (2020).

Epam. Imago. https://lifescience.opensource.epam.com/imago/index.html (2013).

Brinkhaus, O. Img2Mol_standalone at f8143858cac1aabad348fe79448abf5328a853fc. (Github, 2022).

SwinOCSR. (Github, 2022).

Gaulton A, et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012;40:D1100–D1107. doi: 10.1093/nar/gkr777. PubMed DOI PMC

Sorokina M, Merseburger P, Rajan K, Yirik MA, Steinbeck C. COCONUT online: collection of Open Natural Products database. J. Cheminform. 2021;13:2. doi: 10.1186/s13321-020-00478-9. PubMed DOI PMC

Tan, M. & Le, Q. V. EfficientNet: rethinking model scaling for convolutional neural networks. https://arxiv.org/pdf/1905.11946.pdf (2019).

Hu, J., Shen, L., Albanie, S., Sun, G. & Wu, E. Squeeze-and-excitation networks. https://arxiv.org/abs/1709.01507 (2017). PubMed

Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3:32–35. doi: 10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3. PubMed DOI

Fluss R, Faraggi D, Reiser B. Estimation of the Youden Index and its associated cutoff point. Biom. J. 2005;47:458–472. doi: 10.1002/bimj.200410135. PubMed DOI

Hastings J, et al. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Res. 2016;44:D1214–D1219. doi: 10.1093/nar/gkv1031. PubMed DOI PMC

Zhong, X., Tang, J. & Jimeno Yepes, A. PubLayNet: largest dataset ever for document layout analysis. in 2019 International Conference on Document Analysis and Recognition (ICDAR). 1015–1022 (2019).

Rajan K, Zielesny A, Steinbeck C. STOUT: SMILES to IUPAC names using neural machine translation. J. Cheminform. 2021;13:34. doi: 10.1186/s13321-021-00512-4. PubMed DOI PMC

Isabel agea, M. DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. DECIMER Image classifier dataset. 10.5281/ZENODO.6670746 (2022).

Rajan, K., Brinkhaus, O. & Zulfiqar, M. DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. DECIMER-Image-Segmentation: DECIMER-Segmentation-1.1.1. 10.5281/zenodo.7299334 (2022).

Rajan, K., Brinkhaus, H. O., Zielesny, A. & Steinbeck, C. DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. DECIMER V2 Models. 10.5281/zenodo.7624994 (2023). PubMed PMC

Rajan, K., Brinkhaus, H. O., Isabel Agea, M., Zielesny, A. & Steinbeck, C. DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. DECIMER V2 Benchmark Datasets. 10.5281/zenodo.8139328 (2023). PubMed PMC

Rajan, K., Zielesny, A. & Steinbeck, C. DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. DECIMER-V2. 10.5281/zenodo.7655952. (2023). PubMed PMC

Brinkhaus, O. & Rajan, K. DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. DECIMER.ai 1.1.0. 10.5281/zenodo.8139383 (2023). PubMed PMC

Brinkhaus, H. O. & Rajan, K. RanDepict: random chemical structure depiction generator RanDepict. 10.5281/zenodo.8146292 (2023). PubMed PMC

Najít záznam

Citační ukazatele

Nahrávání dat ...

Možnosti archivace

Nahrávání dat ...