This paper presents an implementation of the parallelization of genetic algorithms. Three models of parallelized genetic algorithms are presented, namely the Master-Slave genetic algorithm, the Coarse-Grained genetic algorithm, and the Fine-Grained genetic algorithm. Furthermore, these models are compared with the basic serial genetic algorithm model. Four modules, Multiprocessing, Celery, PyCSP, and Scalable Concurrent Operation in Python, were investigated among the many parallelization options in Python. The Scalable Concurrent Operation in Python was selected as the most favorable option, so the models were implemented using the Python programming language, RabbitMQ, and SCOOP. Based on the implementation results and testing performed, a comparison of the hardware utilization of each deployed model is provided. The results' implementation using SCOOP was investigated from three aspects. The first aspect was the parallelization and integration of the SCOOP module into the resulting Python module. The second was the communication within the genetic algorithm topology. The third aspect was the performance of the parallel genetic algorithm model depending on the hardware.
- Keywords
- Coarse-Grained, Fine-Grained, Master–Slave, SCOOP, parallelized genetic algorithms,
- MeSH
- Algorithms * MeSH
- Computers * MeSH
- Publication type
- Journal Article MeSH
BACKGROUND: Identification of coordinately regulated genes according to the level of their expression during the time course of a process allows for discovering functional relationships among genes involved in the process. RESULTS: We present a single class classification method for the identification of genes of similar function from a gene expression time series. It is based on a parallel genetic algorithm which is a supervised computer learning method exploiting prior knowledge of gene function to identify unknown genes of similar function from expression data. The algorithm was tested with a set of randomly generated patterns; the results were compared with seven other classification algorithms including support vector machines. The algorithm avoids several problems associated with unsupervised clustering methods, and it shows better performance then the other algorithms. The algorithm was applied to the identification of secondary metabolite gene clusters of the antibiotic-producing eubacterium Streptomyces coelicolor. The algorithm also identified pathways associated with transport of the secondary metabolites out of the cell. We used the method for the prediction of the functional role of particular ORFs based on the expression data. CONCLUSION: Through analysis of a time series of gene expression, the algorithm identifies pathways which are directly or indirectly associated with genes of interest, and which are active during the time course of the experiment.
- MeSH
- Algorithms MeSH
- Chromosomes, Bacterial genetics MeSH
- Computer Simulation MeSH
- Oligonucleotide Array Sequence Analysis MeSH
- Gene Expression Profiling * MeSH
- Streptomyces coelicolor classification genetics metabolism MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
Computational design of new proteins is often performed by optimizing the amino acid sequence. This sequence is characterized by an energy (lower energy means better propensity to form the desired 3D structure) that is sampled and minimized. Here, we use the parallel tempering algorithm to accelerate this task. ESMfold was used to predict the structures of the sampled proteins and calculate energy. Starting from random amino acid sequences, each sequence was sampled using the Monte Carlo method at one of a series of temperatures, and these replicas were being exchanged by the parallel tempering method. A series of 100 or 200 residue proteins was designed to maximize confidence in structure prediction and globularity and minimize surface hydrophobic residues. We show that parallel tempering is a viable alternative to Monte Carlo sampling without replica exchanges and simulated annealing or related energy-based protein design methods, especially in the situation where a continuous flow of designed sequences is desired.
- Keywords
- ESMfold, Monte Carlo, machine learning, parallel tempering, protein design, replica exchange,
- MeSH
- Algorithms * MeSH
- Hydrophobic and Hydrophilic Interactions MeSH
- Protein Conformation MeSH
- Monte Carlo Method MeSH
- Models, Molecular MeSH
- Protein Engineering * methods MeSH
- Proteins * chemistry genetics MeSH
- Amino Acid Sequence MeSH
- Thermodynamics MeSH
- Publication type
- Journal Article MeSH
- Names of Substances
- Proteins * MeSH
BACKGROUND: Genomic selection (GS) in forestry can substantially reduce the length of breeding cycle and increase gain per unit time through early selection and greater selection intensity, particularly for traits of low heritability and late expression. Affordable next-generation sequencing technologies made it possible to genotype large numbers of trees at a reasonable cost. RESULTS: Genotyping-by-sequencing was used to genotype 1,126 Interior spruce trees representing 25 open-pollinated families planted over three sites in British Columbia, Canada. Four imputation algorithms were compared (mean value (MI), singular value decomposition (SVD), expectation maximization (EM), and a newly derived, family-based k-nearest neighbor (kNN-Fam)). Trees were phenotyped for several yield and wood attributes. Single- and multi-site GS prediction models were developed using the Ridge Regression Best Linear Unbiased Predictor (RR-BLUP) and the Generalized Ridge Regression (GRR) to test different assumption about trait architecture. Finally, using PCA, multi-trait GS prediction models were developed. The EM and kNN-Fam imputation methods were superior for 30 and 60% missing data, respectively. The RR-BLUP GS prediction model produced better accuracies than the GRR indicating that the genetic architecture for these traits is complex. GS prediction accuracies for multi-site were high and better than those of single-sites while multi-site predictability produced the lowest accuracies reflecting type-b genetic correlations and deemed unreliable. The incorporation of genomic information in quantitative genetics analyses produced more realistic heritability estimates as half-sib pedigree tended to inflate the additive genetic variance and subsequently both heritability and gain estimates. Principle component scores as representatives of multi-trait GS prediction models produced surprising results where negatively correlated traits could be concurrently selected for using PCA2 and PCA3. CONCLUSIONS: The application of GS to open-pollinated family testing, the simplest form of tree improvement evaluation methods, was proven to be effective. Prediction accuracies obtained for all traits greatly support the integration of GS in tree breeding. While the within-site GS prediction accuracies were high, the results clearly indicate that single-site GS models ability to predict other sites are unreliable supporting the utilization of multi-site approach. Principle component scores provided an opportunity for the concurrent selection of traits with different phenotypic optima.
- MeSH
- Algorithms MeSH
- Wood * MeSH
- Genomics methods MeSH
- Genotyping Techniques * MeSH
- Models, Genetic MeSH
- Sequence Analysis * MeSH
- Plant Breeding methods MeSH
- Picea genetics growth & development MeSH
- High-Throughput Nucleotide Sequencing MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
Understanding the evolutionary conservation of complex eukaryotic transcriptomes significantly illuminates the physiological relevance of alternative splicing (AS). Examining the evolutionary depth of a given AS event with ordinary homology searches is generally challenging and time-consuming. Here, we present Catsnap, an algorithmic pipeline for assessing the conservation of putative protein isoforms generated by AS. It employs a machine learning approach following a database search with the provided pair of protein sequences. We used the Catsnap algorithm for analyzing the conservation of emerging experimentally characterized alternative proteins from plants and animals. Indeed, most of them are conserved among other species. Catsnap can detect the conserved functional protein isoforms regardless of the AS type by which they are generated. Notably, we found that while the primary amino acid sequence is maintained, the type of AS determining the inclusion or exclusion of protein regions varies throughout plant phylogenetic lineages in these proteins. We also document that this phenomenon is less seen among animals. In sum, our algorithm highlights the presence of unexpectedly frequent hotspots where protein isoforms recurrently arise to carry physiologically relevant functions. The user web interface is available at https://catsnap.cesnet.cz/.
- Keywords
- alternative splicing, bioinformatics, determinism, isoforms, machine learning, molecular evolution, transcriptome,
- MeSH
- Algorithms * MeSH
- Alternative Splicing * genetics MeSH
- Phylogeny MeSH
- Conserved Sequence genetics MeSH
- Evolution, Molecular MeSH
- Mutant Proteins MeSH
- Protein Isoforms genetics MeSH
- Plants MeSH
- Amino Acid Sequence MeSH
- Animals MeSH
- Check Tag
- Animals MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
- Names of Substances
- Mutant Proteins MeSH
- Protein Isoforms MeSH
BACKGROUND: Microarray technologies now belong to the standard functional genomics toolbox and have undergone massive development leading to increased genome coverage, accuracy and reliability. The number of experiments exploiting microarray technology has markedly increased in recent years. In parallel with the rapid accumulation of transcriptomic data, on-line analysis tools are being introduced to simplify their use. Global statistical data analysis methods contribute to the development of overall concepts about gene expression patterns and to query and compose working hypotheses. More recently, these applications are being supplemented with more specialized products offering visualization and specific data mining tools. We present a curated gene family-oriented gene expression database, Arabidopsis Gene Family Profiler (aGFP; http://agfp.ueb.cas.cz), which gives the user access to a large collection of normalised Affymetrix ATH1 microarray datasets. The database currently contains NASC Array and AtGenExpress transcriptomic datasets for various tissues at different developmental stages of wild type plants gathered from nearly 350 gene chips. RESULTS: The Arabidopsis GFP database has been designed as an easy-to-use tool for users needing an easily accessible resource for expression data of single genes, pre-defined gene families or custom gene sets, with the further possibility of keyword search. Arabidopsis Gene Family Profiler presents a user-friendly web interface using both graphic and text output. Data are stored at the MySQL server and individual queries are created in PHP script. The most distinguishable features of Arabidopsis Gene Family Profiler database are: 1) the presentation of normalized datasets (Affymetrix MAS algorithm and calculation of model-based gene-expression values based on the Perfect Match-only model); 2) the choice between two different normalization algorithms (Affymetrix MAS4 or MAS5 algorithms); 3) an intuitive interface; 4) an interactive "virtual plant" visualizing the spatial and developmental expression profiles of both gene families and individual genes. CONCLUSION: Arabidopsis GFP gives users the possibility to analyze current Arabidopsis developmental transcriptomic data starting with simple global queries that can be expanded and further refined to visualize comparative and highly selective gene expression profiles.
Allelic variability in the adaptive immune receptor loci, which harbor the gene segments that encode B cell and T cell receptors (BCR/TCR), is of critical importance for immune responses to pathogens and vaccines. Adaptive immune receptor repertoire sequencing (AIRR-seq) has become widespread in immunology research making it the most readily available source of information about allelic diversity in immunoglobulin (IG) and T cell receptor (TR) loci. Here, we present a novel algorithm for extrasensitive and specific variable (V) and joining (J) gene allele inference, allowing the reconstruction of individual high-quality gene segment libraries. The approach can be applied for inferring allelic variants from peripheral blood lymphocyte BCR and TCR repertoire sequencing data, including hypermutated isotype-switched BCR sequences, thus allowing high-throughput novel allele discovery from a wide variety of existing data sets. The developed algorithm is a part of the MiXCR software. We demonstrate the accuracy of this approach using AIRR-seq paired with long-read genomic sequencing data, comparing it to a widely used algorithm, TIgGER. We applied the algorithm to a large set of IG heavy chain (IGH) AIRR-seq data from 450 donors of ancestrally diverse population groups, and to the largest reported full-length TCR alpha and beta chain (TRA and TRB) AIRR-seq data set, representing 134 individuals. This allowed us to assess the genetic diversity within the IGH, TRA, and TRB loci in different populations and to establish a database of alleles of V and J genes inferred from AIRR-seq data and their population frequencies with free public access through VDJ.online database.
- MeSH
- Alleles * MeSH
- Algorithms * MeSH
- Genetic Variation MeSH
- Humans MeSH
- Receptors, Antigen, B-Cell genetics immunology MeSH
- Receptors, Antigen, T-Cell genetics immunology MeSH
- Sequence Analysis, DNA methods MeSH
- Software * MeSH
- High-Throughput Nucleotide Sequencing methods MeSH
- Check Tag
- Humans MeSH
- Publication type
- Journal Article MeSH
- Names of Substances
- Receptors, Antigen, B-Cell MeSH
- Receptors, Antigen, T-Cell MeSH
BACKGROUND: Variants in the human X-linked cyclin-dependent kinase-like 5 (CDKL5) gene have been reported as being etiologically associated with early infantile epileptic encephalopathy type 2 (EIEE2). We report on two patients, a boy and a girl, with EIEE2 that present with early onset epilepsy, hypotonia, severe intellectual disability, and poor eye contact. METHODS: Massively parallel sequencing (MPS) of a custom-designed gene panel for epilepsy and epileptic encephalopathy containing 112 epilepsy-related genes was performed. Sanger sequencing was used to confirm the novel variants. For confirmation of the functional consequence of an intronic CDKL5 variant in patient 2, an RNA study was done. RESULTS: DNA sequencing revealed de novo variants in CDKL5, a c.2578C>T (p. Gln860*) present in a hemizygous state in a 3-year-old boy, and a potential splice site variant c.463+5G>A in heterozygous state in a 5-year-old girl. Multiple in silico splicing algorithms predicted a highly reduced splice site score for c.463+5G>A. A subsequent mRNA study confirmed an aberrant shorter transcript lacking exon 7. CONCLUSIONS: Our data confirmed that variants in the CDKL5 are associated with EIEE2. There is credible evidence that the novel identified variants are pathogenic and, therefore, are likely the cause of the disease in the presented patients. In one of the patients a stop codon variant is predicted to produce a truncated protein, and in the other patient an intronic variant results in aberrant splicing.
- Keywords
- CDKL5 gene, early onset seizures, infantile epileptic encephalopathy 2, massively parallel sequencing, splice site variant,
- MeSH
- Epilepsy genetics MeSH
- Epileptic Syndromes MeSH
- Exons MeSH
- Genetic Variation genetics MeSH
- Spasms, Infantile genetics MeSH
- Humans MeSH
- Mutation MeSH
- Child, Preschool MeSH
- Protein Serine-Threonine Kinases genetics metabolism MeSH
- Rett Syndrome genetics MeSH
- High-Throughput Nucleotide Sequencing MeSH
- Check Tag
- Humans MeSH
- Male MeSH
- Child, Preschool MeSH
- Female MeSH
- Publication type
- Journal Article MeSH
- Case Reports MeSH
- Names of Substances
- CDKL5 protein, human MeSH Browser
- Protein Serine-Threonine Kinases MeSH
- MeSH
- Algorithms MeSH
- Databases, Genetic MeSH
- Humans MeSH
- Melanoma chemistry genetics mortality MeSH
- Mice MeSH
- Receptors, Antigen, T-Cell chemistry genetics MeSH
- Receptors, Antigen * analysis genetics metabolism MeSH
- RNA analysis genetics MeSH
- Sequence Analysis, RNA methods MeSH
- Gene Expression Profiling methods MeSH
- Computational Biology methods MeSH
- High-Throughput Nucleotide Sequencing MeSH
- Animals MeSH
- Check Tag
- Humans MeSH
- Mice MeSH
- Animals MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
- Names of Substances
- Receptors, Antigen, T-Cell MeSH
- Receptors, Antigen * MeSH
- RNA MeSH
Pathogenic sequence variants in the IQ motif- and Sec7 domain-containing protein 2 (IQSEC2) gene have been confirmed as causative in the aetiopathogenesis of neurodevelopmental disorders (intellectual disability, autism) and epilepsy. We report on a case of a family with three sons; two of them manifest delayed psychomotor development and epilepsy. Initially proband A was examined using a multistep molecular diagnostics algorithm, including karyotype and array-comparative genomic hybridization analysis, both with negative results. Therefore, probands A and B and their unaffected parents were enrolled for an analysis using targeted "next-generation" sequencing (NGS) with a gene panel ClearSeq Inherited DiseaseXT (Agilent Technologies) and verification analysis by Sanger sequencing. A novel frameshift variant in the X-linked IQSEC2 gene NM_001111125.2:c.1813_1814del, p.(Asp605Profs*3) on protein level, was identified in both affected probands and their asymptomatic mother, having skewed X chromosome inactivation (XCI) (100:0). As the IQSEC2 gene is a known gene escaping from XCI in humans, we expect the existence of mechanisms maintaining the normal or enough level of the IQSEC2 protein in the asymptomatic mother. Further analyses may help to the characterization of the presented novel frameshift variant in the IQSEC2 gene as well as to elucidate the mechanisms leading to the rare asymptomatic phenotypes in females.
- Keywords
- Epilepsy, IQSEC2 gene, Neurodevelopmental disorders, Pathogenic sequence variant, Targeted NGS,
- MeSH
- Algorithms MeSH
- Gene Deletion MeSH
- Child MeSH
- Epilepsy complications genetics MeSH
- Phenotype MeSH
- Genetic Variation * MeSH
- X Chromosome Inactivation MeSH
- Karyotyping MeSH
- Humans MeSH
- Neurodevelopmental Disorders complications genetics MeSH
- Frameshift Mutation MeSH
- Child, Preschool MeSH
- Chromosome Banding MeSH
- Oligonucleotide Array Sequence Analysis MeSH
- Comparative Genomic Hybridization * MeSH
- Guanine Nucleotide Exchange Factors genetics MeSH
- High-Throughput Nucleotide Sequencing MeSH
- Check Tag
- Child MeSH
- Humans MeSH
- Male MeSH
- Child, Preschool MeSH
- Female MeSH
- Publication type
- Journal Article MeSH
- Case Reports MeSH
- Research Support, Non-U.S. Gov't MeSH
- Names of Substances
- IQSEC2 protein, human MeSH Browser
- Guanine Nucleotide Exchange Factors MeSH