JavaScript NENÍ povolen !

Prosím povolte JavaScript.

Článek
Článek online

FT
Medvik - BMČ

Je něco špatně v tomto záznamu ?

Genomic benchmarks: a collection of datasets for genomic sequence classification

K. Grešová, V. Martinek, D. Čechák, P. Šimeček, P. Alexiou

Grešová, Katarína
Autor Grešová, Katarína Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
Martinek, Vlastimil
Autor Martinek, Vlastimil Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
Čechák, David
Autor Čechák, David Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
Šimeček, Petr
Autor Šimeček, Petr ORCID Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia. petr.simecek@ceitec.muni.cz
Alexiou, Panagiotis
Autor Alexiou, Panagiotis Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia

BMC genomic data. 2023 ; 24 (1) : 25. [pub] 20230501

BMC Genom Data
ISSN 2730-6844
Medvik
Zdroj

Jazyk angličtina Země Anglie, Velká Británie

Typ dokumentu časopisecké články, práce podpořená grantem

Perzistentní odkaz https://www.medvik.cz/link/bmc23011522

Online Plný text

NLK BioMedCentral od 2000-01-12
Directory of Open Access Journals od 2021
PubMed Central od 2021
ProQuest Central od 2009-01-01
Medline Complete (EBSCOhost) od 2021-01-25
Health & Medicine (ProQuest) od 2009-01-01
ROAD: Directory of Open Access Scholarly Resources od 2021
Springer Nature OA/Free Journals od 2000-12-01

PubMed 37127596
DOI 10.1186/s12863-023-01123-8
Knihovny.cz E-zdroje

BACKGROUND: Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. RESULTS: Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package 'genomic-benchmarks', and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks . CONCLUSIONS: Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.

Centre for Molecular Medicine Central European Institute of Technology Masaryk University Brno Czechia

National Centre for Biomolecular Research Faculty of Science Masaryk University Brno Czechia

Citace poskytuje Crossref.org

000: 00000naa a2200000 a 4500

001: bmc23011522

003: CZ-PrNML

005: 20230801133116.0

007: ta

008: 230718s2023 enk f 000 0|eng||

009: AR

024 7_: $a 10.1186/s12863-023-01123-8 $2 doi

035 __: $a (PubMed)37127596

040 __: $a ABA008 $b cze $d ABA008 $e AACR2

041 0_: $a eng

044 __: $a enk

100 1_: $a Grešová, Katarína $u Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia $u National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia

245 10: $a Genomic benchmarks: a collection of datasets for genomic sequence classification / $c K. Grešová, V. Martinek, D. Čechák, P. Šimeček, P. Alexiou

520 9_: $a BACKGROUND: Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. RESULTS: Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package 'genomic-benchmarks', and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks . CONCLUSIONS: Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.

650 _2: $a lidé $7 D006801

650 _2: $a zvířata $7 D000818

650 _2: $a myši $7 D051379

650 12: $a benchmarking $7 D019985

650 12: $a neuronové sítě $7 D016571

650 _2: $a genomika $x metody $7 D023281

650 _2: $a strojové učení $7 D000069550

650 _2: $a chromatin $7 D002843

655 _2: $a časopisecké články $7 D016428

655 _2: $a práce podpořená grantem $7 D013485

700 1_: $a Martinek, Vlastimil $u Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia $u National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia

700 1_: $a Čechák, David $u Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia $u National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia

700 1_: $a Šimeček, Petr $u Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia. petr.simecek@ceitec.muni.cz $1 https://orcid.org/0000000229227183

700 1_: $a Alexiou, Panagiotis $u Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia

773 0_: $w MED00211332 $t BMC genomic data $x 2730-6844 $g Roč. 24, č. 1 (2023), s. 25

856 41: $u https://pubmed.ncbi.nlm.nih.gov/37127596 $y Pubmed

910 __: $a ABA008 $b sig $c sign $y p $z 0

990 __: $a 20230718 $b ABA008

991 __: $a 20230801133113 $b ABA008

999 __: $a ok $b bmc $g 1963752 $s 1197787

BAS __: $a 3

BAS __: $a PreBMC-MEDLINE

BMC __: $a 2023 $b 24 $c 1 $d 25 $e 20230501 $i 2730-6844 $m BMC genomic data $n BMC Genom Data $x MED00211332

LZP __: $a Pubmed-20230718

Najít záznam

v PubMed

Citační ukazatele

Pouze přihlášení uživatelé

Genomic benchmarks: a collection of datasets for genomic sequence classification

Najít záznam

Citační ukazatele

Možnosti archivace