Database mining
Dotaz
Zobrazit nápovědu
As the amount of genome information increases rapidly, there is a correspondingly greater need for methods that provide accurate and automated annotation of gene function. For example, many high-throughput technologies--e.g., next-generation sequencing--are being used today to generate lists of genes associated with specific conditions. However, their functional interpretation remains a challenge and many tools exist trying to characterize the function of gene-lists. Such systems rely typically in enrichment analysis and aim to give a quick insight into the underlying biology by presenting it in a form of a summary-report. While the load of annotation may be alleviated by such computational approaches, the main challenge in modern annotation remains to develop a systems form of analysis in which a pipeline can effectively analyze gene-lists quickly and identify aggregated annotations through computerized resources. In this article we survey some of the many such tools and methods that have been developed to automatically interpret the biological functions underlying gene-lists. We overview current functional annotation aspects from the perspective of their epistemology (i.e., the underlying theories used to organize information about gene function into a body of verified and documented knowledge) and find that most of the currently used functional annotation methods fall broadly into one of two categories: they are based either on 'known' formally-structured ontology annotations created by 'experts' (e.g., the GO terms used to describe the function of Entrez Gene entries), or--perhaps more adventurously--on annotations inferred from literature (e.g., many text-mining methods use computer-aided reasoning to acquire knowledge represented in natural languages). Overall however, deriving detailed and accurate insight from such gene lists remains a challenging task, and improved methods are called for. In particular, future methods need to (1) provide more holistic insight into the underlying molecular systems; (2) provide better follow-up experimental testing and treatment options, and (3) better manage gene lists derived from organisms that are not well-studied. We discuss some promising approaches that may help achieve these advances, especially the use of extended dictionaries of biomedical concepts and molecular mechanisms, as well as greater use of annotation benchmarks.
- Klíčová slova
- statistika, vícerozměrná analýza, velké datové soubory,
- MeSH
- databáze genetické trendy využití MeSH
- distanční studium metody trendy MeSH
- financování organizované MeSH
- genetické techniky trendy využití MeSH
- lékařská informatika MeSH
- lidé MeSH
- počítačem řízená výuka přístrojové vybavení využití MeSH
- sběr dat metody trendy MeSH
- statistika jako téma MeSH
- teoretické modely MeSH
- zobrazování dat trendy MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- databáze MeSH
Genetic variation occurring within conserved functional protein domains warrants special attention when examining DNA variation in the context of disease causation. Here we introduce a resource, freely available at www.prot2hg.com, that addresses the question of whether a particular variant falls onto an annotated protein domain and directly translates chromosomal coordinates onto protein residues. The tool can perform a multiple-site query in a simple way, and the whole dataset is available for download as well as incorporated into our own accessible pipeline. To create this resource, National Center for Biotechnology Information protein data were retrieved using the Entrez Programming Utilities. After processing all human protein domains, residue positions were reverse translated and mapped to the reference genome hg19 and stored in a MySQL database. In total, 760 487 protein domains from 42 371 protein models were mapped to hg19 coordinates and made publicly available for search or download (www.prot2hg.com). In addition, this annotation was implemented into the genomics research platform GENESIS in order to query nearly 8000 exomes and genomes of families with rare Mendelian disorders (tgp-foundation.org). When applied to patient genetic data, we found that rare (<1%) variants in the Genome Aggregation Database were significantly more annotated onto a protein domain in comparison to common (>1%) variants. Similarly, variants described as pathogenic or likely pathogenic in ClinVar were more likely to be annotated onto a domain. In addition, we tested a dataset consisting of 60 causal variants in a cohort of patients with epileptic encephalopathy and found that 71% of them (43 variants) were propagated onto protein domains. In summary, we developed a resource that annotates variants in the coding part of the genome onto conserved protein domains in order to increase variant prioritization efficiency.Database URL: www.prot2hg.com.
- MeSH
- anotace sekvence metody MeSH
- data mining metody MeSH
- databáze genetické * MeSH
- datové kurátorství metody MeSH
- genetická variace * MeSH
- genom lidský genetika MeSH
- genomika metody MeSH
- internet MeSH
- lidé MeSH
- proteinové domény genetika MeSH
- proteiny chemie genetika metabolismus MeSH
- výpočetní biologie metody MeSH
- Check Tag
- lidé MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
Proteins are the most abundant component of the cell nucleus, where they perform a plethora of functions, including the assembly of long DNA molecules into condensed chromatin, DNA replication and repair, regulation of gene expression, synthesis of RNA molecules and their modification. Proteins are important components of nuclear bodies and are involved in the maintenance of the nuclear architecture, transport across the nuclear envelope and cell division. Given their importance, the current poor knowledge of plant nuclear proteins and their dynamics during the cell's life and division is striking. Several factors hamper the analysis of the plant nuclear proteome, but the most critical seems to be the contamination of nuclei by cytosolic material during their isolation. With the availability of an efficient protocol for the purification of plant nuclei, based on flow cytometric sorting, contamination by cytoplasmic remnants can be minimized. Moreover, flow cytometry allows the separation of nuclei in different stages of the cell cycle (G1, S, and G2). This strategy has led to the identification of large number of nuclear proteins from barley (Hordeum vulgare), thus triggering the creation of a dedicated database called UNcleProt, http://barley.gambrinus.ueb.cas.cz/ .
Lipidomics and metabolomics communities comprise various informatics tools; however, software programs handling multimodal mass spectrometry (MS) data with structural annotations guided by the Lipidomics Standards Initiative are limited. Here, we provide MS-DIAL 5 for in-depth lipidome structural elucidation through electron-activated dissociation (EAD)-based tandem MS and determining their molecular localization through MS imaging (MSI) data using a species/tissue-specific lipidome database containing the predicted collision-cross section values. With the optimized EAD settings using 14 eV kinetic energy, the program correctly delineated lipid structures for 96.4% of authentic standards, among which 78.0% had the sn-, OH-, and/or C = C positions correctly assigned at concentrations exceeding 1 μM. We showcased our workflow by annotating the sn- and double-bond positions of eye-specific phosphatidylcholines containing very-long-chain polyunsaturated fatty acids (VLC-PUFAs), characterized as PC n-3-VLC-PUFA/FA. Using MSI data from the eye and n-3-VLC-PUFA-supplemented HeLa cells, we identified glycerol 3-phosphate acyltransferase as an enzyme candidate responsible for incorporating n-3 VLC-PUFAs into the sn1 position of phospholipids in mammalian cells, which was confirmed using EAD-MS/MS and recombinant proteins in a cell-free system. Therefore, the MS-DIAL 5 environment, combined with optimized MS data acquisition methods, facilitates a better understanding of lipid structures and their localization, offering insights into lipid biology.
- MeSH
- data mining * metody MeSH
- fosfatidylcholiny metabolismus chemie MeSH
- HeLa buňky MeSH
- hmotnostní spektrometrie metody MeSH
- lidé MeSH
- lipidomika * metody MeSH
- lipidy chemie analýza MeSH
- metabolomika metody MeSH
- nenasycené mastné kyseliny metabolismus chemie MeSH
- software MeSH
- tandemová hmotnostní spektrometrie metody MeSH
- zvířata MeSH
- Check Tag
- lidé MeSH
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
- MeSH
- chromozomy rostlin genetika MeSH
- chromozomy genetika MeSH
- data mining MeSH
- databáze bibliografické MeSH
- databáze genetické * MeSH
- houby genetika MeSH
- internet MeSH
- rostliny genetika MeSH
- zvířata MeSH
- Check Tag
- zvířata MeSH
- Publikační typ
- dopisy MeSH
- práce podpořená grantem MeSH
The aim of this study was to discover new nitrilases with useful activities, especially towards dinitriles that are precursors of high-value cyano acids. Genes coding for putative nitrilases of different origins (fungal, plant, or bacterial) with moderate similarities to known nitrilases were selected by mining the GenBank database, synthesized artificially and expressed in Escherichia coli. The enzymes were purified, examined for their substrate specificities, and classified into subtypes (aromatic nitrilase, arylacetonitrilase, aliphatic nitrilase, cyanide hydratase) which were largely in accordance with those predicted from bioinformatic analysis. The catalytic potential of the nitrilases for dinitriles was examined with cyanophenyl acetonitriles, phenylenediacetonitriles, and fumaronitrile. The nitrilase activities and selectivities for dinitriles and the reaction products (cyano acid, cyano amide, diacid) depended on the enzyme subtype. At a preparative scale, all the examined dinitriles were hydrolyzed into cyano acids and fumaronitrile was converted to cyano amide using E. coli cells producing arylacetonitrilases and an aromatic nitrilase, respectively.
- MeSH
- aminohydrolasy genetika metabolismus MeSH
- data mining MeSH
- Escherichia coli genetika metabolismus MeSH
- exprese genu MeSH
- klonování DNA MeSH
- nitrily metabolismus MeSH
- rekombinantní proteiny izolace a purifikace metabolismus MeSH
- substrátová specifita MeSH
- výpočetní biologie MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
Seed characteristics play an important role in the colonization and subsequent persistence of species during succession in disturbed sites and thus may contribute to being able to predict restoration success. In the present study, we investigated how various seed characteristics participated in 11 spontaneous successional series running in different mining sites (spoil heaps, extracted sand and sand-gravel pits, extracted peatlands, and stone quarries) in the Czech Republic, Central Europe. Using 1864 samples from 1- to 100-years-old successional stages, we tested whether species optimum along the succession gradient could be predicted using 10 basic species traits connected with diaspores and dispersal. Seed longevity, diaspore mass, endozoochory, and autochory appeared to be the best predictors. The results indicate that seed characteristics can predict to a certain degree spontaneous vegetation succession, i.e., passive restoration, in the mining sites. A screening of species available in the given landscape (regional and local species pools) may help to identify those species which would potentially colonize the disturbed sites. Extensive databases of species traits, nowadays available for the Central European flora, enable such screening.
- MeSH
- časové faktory MeSH
- distribuce rostlin * MeSH
- ekosystém MeSH
- hornictví * MeSH
- regenerace a remediace životního prostředí * MeSH
- semena rostlinná růst a vývoj MeSH
- Publikační typ
- časopisecké články MeSH
- Geografické názvy
- Česká republika MeSH