DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications
Status PubMed-not-MEDLINE Jazyk angličtina Země Anglie, Velká Británie Médium electronic
Typ dokumentu časopisecké články
Grantová podpora
239748522
Deutsche Forschungsgemeinschaft (German Research Foundation)
239748522
Deutsche Forschungsgemeinschaft (German Research Foundation)
LM2023052
Ministerstvo Školství, Mládeže a Tělovýchovy (Ministry of Education, Youth and Sports)
PubMed
37598180
PubMed Central
PMC10439916
DOI
10.1038/s41467-023-40782-0
PII: 10.1038/s41467-023-40782-0
Knihovny.cz E-zdroje
- Publikační typ
- časopisecké články MeSH
The number of publications describing chemical structures has increased steadily over the last decades. However, the majority of published chemical information is currently not available in machine-readable form in public databases. It remains a challenge to automate the process of information extraction in a way that requires less manual intervention - especially the mining of chemical structure depictions. As an open-source platform that leverages recent advancements in deep learning, computer vision, and natural language processing, DECIMER.ai (Deep lEarning for Chemical IMagE Recognition) strives to automatically segment, classify, and translate chemical structure depictions from the printed literature. The segmentation and classification tools are the only openly available packages of their kind, and the optical chemical structure recognition (OCSR) core application yields outstanding performance on all benchmark datasets. The source code, the trained models and the datasets developed in this work have been published under permissive licences. An instance of the DECIMER web application is available at https://decimer.ai .
Zobrazit více v PubMed
Brinkhaus HO, Rajan K, Schaub J, Zielesny A, Steinbeck C. Open data and algorithms for open science in AI-driven molecular informatics. Curr. Opin. Struct. Biol. 2023;79:102542. doi: 10.1016/j.sbi.2023.102542. PubMed DOI
Herres-Pawlis S, Liermann JC, Koepler O. Research data in chemistry–results of the first NFDI4Chem community survey. Z. Anorg. Allg. Chem. 2020;646:1748–1757. doi: 10.1002/zaac.202000339. DOI
Steinbeck C, et al. NFDI4Chem-towards a national research data infrastructure for chemistry in Germany. Riogrande Odontol. 2020;6:e55852.
NFDI4Chem. nmrXiv-Open, FAIR and Consensus-Driven NMR spectroscopy data repository and analysis platform. nmrXiv-Open, FAIR and Consensus-Driven NMR Spectroscopy Data Repository and Analysis Platform.https://nmrxiv.org/ (2022).
Kearnes SM, et al. The open reaction database. J. Am. Chem. Soc. 2021;143:18820–18826. doi: 10.1021/jacs.1c09820. PubMed DOI
Kim S, et al. PubChem protein, gene, pathway, and taxonomy data collections: bridging biology and chemistry through target-centric views of PubChem Data. J. Mol. Biol. 2022;434:167514. doi: 10.1016/j.jmb.2022.167514. PubMed DOI PMC
wwPDB consortium. Protein data bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2019;47:D520–D528. doi: 10.1093/nar/gky949. PubMed DOI PMC
Swain MC, Cole JM. ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature. J. Chem. Inf. Model. 2016;56:1894–1904. doi: 10.1021/acs.jcim.6b00207. PubMed DOI
Contreras ML, Leonor Contreras M, Allendes C, Tomas Alvarez L, Rozas R. Computational perception and recognition of digitized molecular structures. J. Chem. Inf. Model. 1990;30:302–307.
Rozas R, Fernandez H. Automatic processing of graphics for image databases in science. J. Chem. Inf. Comput. Sci. 1990;30:7–12. doi: 10.1021/ci00065a003. DOI
McDaniel JR, Balmuth JR. Kekule: OCR-optical chemical (structure) recognition. J. Chem. Inf. Comput. Sci. 1992;32:373–378. doi: 10.1021/ci00008a018. DOI
Filippov IV, Nicklaus MC. Optical structure recognition software to recover chemical information: OSRA, an open source solution. J. Chem. Inf. Model. 2009;49:740–743. doi: 10.1021/ci800067r. PubMed DOI PMC
Smolov, V., Zentsev, F. & Rybalkin, M. Imago: open-source toolkit for 2D chemical structure image recognition. Proceedings of Text Retrieval Conference (Gaithersburg, Maryland, USA, 2011).
Peryea, T., Katzel, D., Zhao, T., Southall, N. & Nguyen, D.-T. MOLVEC: Open source library for chemical structure recognition. Abstr. Pap. Am. Chem. Soc.258, (2019).
Clevert, D.-A., Le, T., Winter, R. & Montanari, F. Img2Mol-Accurate SMILES Recognition from Molecular Graphical Depictions. Chem. Sci. 10.1039/D1SC01839F (2021). PubMed PMC
Staker J, Marshall K, Abel R, McQuaw CM. Molecular structure extraction from documents using deep learning. J. Chem. Inf. Model. 2019;59:1017–1029. doi: 10.1021/acs.jcim.8b00669. PubMed DOI
Rajan K, Zielesny A, Steinbeck C. DECIMER: towards deep learning for chemical image recognition. J. Cheminform. 2020;12:65. doi: 10.1186/s13321-020-00469-w. PubMed DOI PMC
Rajan K, Zielesny A, Steinbeck C. DECIMER 1.0: deep learning for chemical image recognition using transformers. J. Cheminform. 2021;13:61. doi: 10.1186/s13321-021-00538-8. PubMed DOI PMC
Rajan K, Brinkhaus HO, Zielesny A, Steinbeck C. A review of optical chemical structure recognition tools. J. Cheminform. 2020;12:60. doi: 10.1186/s13321-020-00465-0. PubMed DOI PMC
Musazade F, Jamalova N, Hasanov J. Review of techniques and models used in optical chemical structure recognition in images and scanned documents. J. Cheminform. 2022;14:61. doi: 10.1186/s13321-022-00642-3. PubMed DOI PMC
Oldenhof M, Arany A, Moreau Y, Simm J. ChemGrapher: optical graph recognition of chemical compounds by deep learning. J. Chem. Inf. Model. 2020;60:4506–4517. doi: 10.1021/acs.jcim.0c00459. PubMed DOI
Khokhlov, I., Krasnov, L., Fedorov, M. V. & Sosnin, S. Image2SMILES: Transformer‐based molecular optical recognition engine. Chem. Methods2, 1 e202100069 (2022).
Xu Y, et al. MolMiner: you only look once for chemical structure recognition. J. Chem. Inf. Model. 2022;62:5321–5328. doi: 10.1021/acs.jcim.2c00733. PubMed DOI PMC
Xu Z, Li J, Yang Z, Li S, Li H. SwinOCSR: end-to-end optical chemical structure recognition using a Swin Transformer. J. Cheminform. 2022;14:41. doi: 10.1186/s13321-022-00624-5. PubMed DOI PMC
He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. https://arxiv.org/abs/1703.06870 (2017). PubMed
Rajan K, Brinkhaus HO, Sorokina M, Zielesny A, Steinbeck C. DECIMER-segmentation: automated extraction of chemical structure depictions from scientific literature. J. Cheminform. 2021;13:20. doi: 10.1186/s13321-021-00496-1. PubMed DOI PMC
DECIMER Web Application. https://decimer.ai (2023).
Willighagen EL, et al. The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J. Cheminform. 2017;9:33. doi: 10.1186/s13321-017-0220-4. PubMed DOI PMC
Landrum, G. & Others. RDKit: Open-Source Cheminformatics Software. (2016). https://github.com/rdkit/rdkit (2016).
Indigo Toolkit. https://lifescience.opensource.epam.com/indigo/ (2009).
Terlouw BR, Vromans SPJM, Medema MH. PIKAChU: a Python-based informatics kit for analysing chemical units. J. Cheminform. 2022;14:34. doi: 10.1186/s13321-022-00616-5. PubMed DOI PMC
Tanimoto, T.T. Elementary Mathematical Theory of Classification and Prediction. (International Business Machines Corporation, 1958).
Jaccard P. The distribution of the flora in the alpine zone.1. New Phytol. 1912;11:37–50. doi: 10.1111/j.1469-8137.1912.tb05611.x. DOI
Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318 (2002).
Qian, Y., Tu, Z., Guo, J., Coley, C. W. & Barzilay, R. Robust molecular image recognition: a graph generation approach. https://arxiv.org/abs/2205.14311 (2022). PubMed
Karulin B, Kozhevnikov M. Ketcher: web-based chemical structure editor. J. Cheminform. 2011;3:1. doi: 10.1186/1758-2946-3-S1-P3. PubMed DOI
Brinkhaus HO, Rajan K, Zielesny A, Steinbeck C. RanDepict: random chemical structure depiction generator. J. Cheminform. 2022;14:31. doi: 10.1186/s13321-022-00609-4. PubMed DOI PMC
Zhang, X.-C. et al. ABC-Net: a divide-and-conquer based deep learning architecture for SMILES recognition from molecular images. Brief. Bioinform. 23, bbac033 (2022). PubMed
Hormazabal R, et al. CEDe: a collection of expert-curated datasets with atom-level entity annotations for optical chemical structure recognition. Adv. Neural Inf. Process. Syst. 2022;35:27114–27126.
Valko AT, Johnson AP. CLiDE Pro: the latest generation of CLiDE, a tool for optical chemical structure recognition. J. Chem. Inf. Model. 2009;49:780–787. doi: 10.1021/ci800449t. PubMed DOI
Mavračić J, Court CJ, Isazawa T, Elliott SR, Cole JM. ChemDataExtractor 2.0: autopopulated ontologies for materials science. J. Chem. Inf. Model. 2021;61:4280–4289. doi: 10.1021/acs.jcim.1c00446. PubMed DOI
Isazawa T, Cole JM. Single model for organic and inorganic chemical named entity recognition in ChemDataExtractor. J. Chem. Inf. Model. 2022;62:1207–1213. doi: 10.1021/acs.jcim.1c01199. PubMed DOI PMC
Beard EJ, Sivaraman G, Vázquez-Mayagoitia Á, Vishwanath V, Cole JM. Comparative dataset of experimental and computational attributes of UV/vis absorption spectra. Sci. Data. 2019;6:307. doi: 10.1038/s41597-019-0306-0. PubMed DOI PMC
Court CJ, Cole JM. Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction. Sci. Data. 2018;5:180111. doi: 10.1038/sdata.2018.111. PubMed DOI PMC
Beard EJ, Cole JM. Perovskite- and dye-sensitized solar-cell device databases auto-generated using ChemDataExtractor. Sci. Data. 2022;9:329. doi: 10.1038/s41597-022-01355-w. PubMed DOI PMC
Huang S, Cole JM. A database of battery materials auto-generated using ChemDataExtractor. Sci. Data. 2020;7:260. doi: 10.1038/s41597-020-00602-2. PubMed DOI PMC
Decimer-segmentation. PyPIhttps://pypi.org/project/decimer-segmentation/ (2023).
Rajan, K., Brinkhaus, H. O., Sorokina, M., Zielesny, A. & Steinbeck, C. DECIMER-Image-Segmentation-GitHub.https://decimer.ai/ (2022).
Rajan, K., Brinkhaus, H. O., Zielesny, A. & Steinbeck, C. DECIMER-Segmentation model. 10.5281/ZENODO.7228583 (2021).
Kim S, et al. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res. 2021;49:D1388–D1395. doi: 10.1093/nar/gkaa971. PubMed DOI PMC
Ashton M, et al. Identification of diverse database subsets using property-based and fragment-based molecular descriptions. Quant. Struct. Act. Relatsh. 2002;21:598–604. doi: 10.1002/qsar.200290002. DOI
Dalke A. The chemfp project. J. Cheminform. 2019;11:76. doi: 10.1186/s13321-019-0398-8. PubMed DOI PMC
O’Boyle, N. & Dalke, A. DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. ChemRxiv10.26434/chemrxiv.7097960.v1 (2018).
Krenn M, Häse F, Nigam A, Friederich P, Aspuru-Guzik A. Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach. Learn. Sci. Technol. 2020;1:045024. doi: 10.1088/2632-2153/aba947. DOI
Rajan K, Steinbeck C, Zielesny A. Performance of chemical structure string representations for chemical image recognition using transformers. Digit. Discov. 2022;1:84–90. doi: 10.1039/D1DD00013F. DOI
Chollet, F. & Others. Keras. https://keras.io (2015).
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. https://arxiv.org/abs/1603.04467 (2016).
Weir H, et al. ChemPix: automated recognition of hand-drawn hydrocarbon structures using deep learning. Chem. Sci. 2021;12:10622–10633. doi: 10.1039/D1SC02957F. PubMed DOI PMC
Vaswani, A. et al. Attention Is All You Need. https://arxiv.org/abs/1706.03762 (2017).
Tan, M. & Le, Q. V. EfficientNetV2: smaller models and faster training. https://arxiv.org/abs/2104.00298 (2021).
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.-C. MobileNetV2: inverted residuals and linear bottlenecks. https://arxiv.org/abs/1801.04381 (2018).
Gupta S., & Tan, M. Efficientnet-edgetpu: creating accelerator-optimized neural networks with automl. https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html (2019).
Rajan, K. OCSR_Review: This Repository Contains the Information Related to the Benchmark Study on Openly Available OCSR tools. (Github) (2020).
OSRA validation datasets. https://sourceforge.net/p/osra/wiki/Validation/ (Accessed 2023).
Sadawi, N. M., Sexton, A. P. & Sorge, V. in Document Recognition and Retrieval XIX. Vol. 8297. 101–109 (SPIE, 2012).
Website. https://www.ifs.tuwien.ac.at/~clef-ip/download/2012/qrels/clef-ip-2012-chem-recognition-qrels.tgz (2022).
Brinkhaus HO, Zielesny A, Steinbeck C, Rajan K. DECIMER-hand-drawn molecule images dataset. J. Cheminform. 2022;14:36. doi: 10.1186/s13321-022-00620-9. PubMed DOI PMC
docker-osra: OSRA (Optical Structure Recognition Application) in Docker. (Github, 2022).
Docker. https://hub.docker.com/repository/docker/obrink/osra (2022).
Molvec JAR 0.9.8. https://jar-download.com/artifacts/gov.nih.ncats/molvec/0.9.8/source-code (2020).
Epam. Imago. https://lifescience.opensource.epam.com/imago/index.html (2013).
Brinkhaus, O. Img2Mol_standalone at f8143858cac1aabad348fe79448abf5328a853fc. (Github, 2022).
SwinOCSR. (Github, 2022).
Gaulton A, et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012;40:D1100–D1107. doi: 10.1093/nar/gkr777. PubMed DOI PMC
Sorokina M, Merseburger P, Rajan K, Yirik MA, Steinbeck C. COCONUT online: collection of Open Natural Products database. J. Cheminform. 2021;13:2. doi: 10.1186/s13321-020-00478-9. PubMed DOI PMC
Tan, M. & Le, Q. V. EfficientNet: rethinking model scaling for convolutional neural networks. https://arxiv.org/pdf/1905.11946.pdf (2019).
Hu, J., Shen, L., Albanie, S., Sun, G. & Wu, E. Squeeze-and-excitation networks. https://arxiv.org/abs/1709.01507 (2017). PubMed
Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3:32–35. doi: 10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3. PubMed DOI
Fluss R, Faraggi D, Reiser B. Estimation of the Youden Index and its associated cutoff point. Biom. J. 2005;47:458–472. doi: 10.1002/bimj.200410135. PubMed DOI
Hastings J, et al. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Res. 2016;44:D1214–D1219. doi: 10.1093/nar/gkv1031. PubMed DOI PMC
Zhong, X., Tang, J. & Jimeno Yepes, A. PubLayNet: largest dataset ever for document layout analysis. in 2019 International Conference on Document Analysis and Recognition (ICDAR). 1015–1022 (2019).
Rajan K, Zielesny A, Steinbeck C. STOUT: SMILES to IUPAC names using neural machine translation. J. Cheminform. 2021;13:34. doi: 10.1186/s13321-021-00512-4. PubMed DOI PMC
Isabel agea, M. DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. DECIMER Image classifier dataset. 10.5281/ZENODO.6670746 (2022).
Rajan, K., Brinkhaus, O. & Zulfiqar, M. DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. DECIMER-Image-Segmentation: DECIMER-Segmentation-1.1.1. 10.5281/zenodo.7299334 (2022).
Rajan, K., Brinkhaus, H. O., Zielesny, A. & Steinbeck, C. DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. DECIMER V2 Models. 10.5281/zenodo.7624994 (2023). PubMed PMC
Rajan, K., Brinkhaus, H. O., Isabel Agea, M., Zielesny, A. & Steinbeck, C. DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. DECIMER V2 Benchmark Datasets. 10.5281/zenodo.8139328 (2023). PubMed PMC
Rajan, K., Zielesny, A. & Steinbeck, C. DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. DECIMER-V2. 10.5281/zenodo.7655952. (2023). PubMed PMC
Brinkhaus, O. & Rajan, K. DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. DECIMER.ai 1.1.0. 10.5281/zenodo.8139383 (2023). PubMed PMC
Brinkhaus, H. O. & Rajan, K. RanDepict: random chemical structure depiction generator RanDepict. 10.5281/zenodo.8146292 (2023). PubMed PMC