-
Je něco špatně v tomto záznamu ?
Genomic benchmarks: a collection of datasets for genomic sequence classification
K. Grešová, V. Martinek, D. Čechák, P. Šimeček, P. Alexiou
Jazyk angličtina Země Anglie, Velká Británie
Typ dokumentu časopisecké články, práce podpořená grantem
NLK
BioMedCentral
od 2000-12-01
Directory of Open Access Journals
od 2021
PubMed Central
od 2021
ProQuest Central
od 2009-01-01
Medline Complete (EBSCOhost)
od 2021-01-25
Health & Medicine (ProQuest)
od 2009-01-01
ROAD: Directory of Open Access Scholarly Resources
od 2021
Springer Nature OA/Free Journals
od 2000-12-01
- MeSH
- benchmarking * MeSH
- chromatin MeSH
- genomika metody MeSH
- lidé MeSH
- myši MeSH
- neuronové sítě * MeSH
- strojové učení MeSH
- zvířata MeSH
- Check Tag
- lidé MeSH
- myši MeSH
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
BACKGROUND: Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. RESULTS: Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package 'genomic-benchmarks', and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks . CONCLUSIONS: Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.
Citace poskytuje Crossref.org
- 000
- 00000naa a2200000 a 4500
- 001
- bmc23011522
- 003
- CZ-PrNML
- 005
- 20230801133116.0
- 007
- ta
- 008
- 230718s2023 enk f 000 0|eng||
- 009
- AR
- 024 7_
- $a 10.1186/s12863-023-01123-8 $2 doi
- 035 __
- $a (PubMed)37127596
- 040 __
- $a ABA008 $b cze $d ABA008 $e AACR2
- 041 0_
- $a eng
- 044 __
- $a enk
- 100 1_
- $a Grešová, Katarína $u Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia $u National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
- 245 10
- $a Genomic benchmarks: a collection of datasets for genomic sequence classification / $c K. Grešová, V. Martinek, D. Čechák, P. Šimeček, P. Alexiou
- 520 9_
- $a BACKGROUND: Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. RESULTS: Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package 'genomic-benchmarks', and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks . CONCLUSIONS: Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.
- 650 _2
- $a lidé $7 D006801
- 650 _2
- $a zvířata $7 D000818
- 650 _2
- $a myši $7 D051379
- 650 12
- $a benchmarking $7 D019985
- 650 12
- $a neuronové sítě $7 D016571
- 650 _2
- $a genomika $x metody $7 D023281
- 650 _2
- $a strojové učení $7 D000069550
- 650 _2
- $a chromatin $7 D002843
- 655 _2
- $a časopisecké články $7 D016428
- 655 _2
- $a práce podpořená grantem $7 D013485
- 700 1_
- $a Martinek, Vlastimil $u Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia $u National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
- 700 1_
- $a Čechák, David $u Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia $u National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
- 700 1_
- $a Šimeček, Petr $u Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia. petr.simecek@ceitec.muni.cz $1 https://orcid.org/0000000229227183
- 700 1_
- $a Alexiou, Panagiotis $u Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
- 773 0_
- $w MED00211332 $t BMC genomic data $x 2730-6844 $g Roč. 24, č. 1 (2023), s. 25
- 856 41
- $u https://pubmed.ncbi.nlm.nih.gov/37127596 $y Pubmed
- 910 __
- $a ABA008 $b sig $c sign $y p $z 0
- 990 __
- $a 20230718 $b ABA008
- 991 __
- $a 20230801133113 $b ABA008
- 999 __
- $a ok $b bmc $g 1963752 $s 1197787
- BAS __
- $a 3
- BAS __
- $a PreBMC-MEDLINE
- BMC __
- $a 2023 $b 24 $c 1 $d 25 $e 20230501 $i 2730-6844 $m BMC genomic data $n BMC Genom Data $x MED00211332
- LZP __
- $a Pubmed-20230718