• Something wrong with this record ?

Genomic benchmarks: a collection of datasets for genomic sequence classification

K. Grešová, V. Martinek, D. Čechák, P. Šimeček, P. Alexiou

. 2023 ; 24 (1) : 25. [pub] 20230501

Language English Country England, Great Britain

Document type Journal Article, Research Support, Non-U.S. Gov't

BACKGROUND: Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. RESULTS: Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package 'genomic-benchmarks', and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks . CONCLUSIONS: Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.

References provided by Crossref.org

000      
00000naa a2200000 a 4500
001      
bmc23011522
003      
CZ-PrNML
005      
20230801133116.0
007      
ta
008      
230718s2023 enk f 000 0|eng||
009      
AR
024    7_
$a 10.1186/s12863-023-01123-8 $2 doi
035    __
$a (PubMed)37127596
040    __
$a ABA008 $b cze $d ABA008 $e AACR2
041    0_
$a eng
044    __
$a enk
100    1_
$a Grešová, Katarína $u Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia $u National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
245    10
$a Genomic benchmarks: a collection of datasets for genomic sequence classification / $c K. Grešová, V. Martinek, D. Čechák, P. Šimeček, P. Alexiou
520    9_
$a BACKGROUND: Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. RESULTS: Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package 'genomic-benchmarks', and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks . CONCLUSIONS: Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.
650    _2
$a lidé $7 D006801
650    _2
$a zvířata $7 D000818
650    _2
$a myši $7 D051379
650    12
$a benchmarking $7 D019985
650    12
$a neuronové sítě $7 D016571
650    _2
$a genomika $x metody $7 D023281
650    _2
$a strojové učení $7 D000069550
650    _2
$a chromatin $7 D002843
655    _2
$a časopisecké články $7 D016428
655    _2
$a práce podpořená grantem $7 D013485
700    1_
$a Martinek, Vlastimil $u Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia $u National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
700    1_
$a Čechák, David $u Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia $u National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
700    1_
$a Šimeček, Petr $u Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia. petr.simecek@ceitec.muni.cz $1 https://orcid.org/0000000229227183
700    1_
$a Alexiou, Panagiotis $u Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
773    0_
$w MED00211332 $t BMC genomic data $x 2730-6844 $g Roč. 24, č. 1 (2023), s. 25
856    41
$u https://pubmed.ncbi.nlm.nih.gov/37127596 $y Pubmed
910    __
$a ABA008 $b sig $c sign $y p $z 0
990    __
$a 20230718 $b ABA008
991    __
$a 20230801133113 $b ABA008
999    __
$a ok $b bmc $g 1963752 $s 1197787
BAS    __
$a 3
BAS    __
$a PreBMC-MEDLINE
BMC    __
$a 2023 $b 24 $c 1 $d 25 $e 20230501 $i 2730-6844 $m BMC genomic data $n BMC Genom Data $x MED00211332
LZP    __
$a Pubmed-20230718

Find record

Citation metrics

Loading data ...

Archiving options

Loading data ...