Annotation term
Dotaz
Zobrazit nápovědu
OBJECTIVES: Our main objective is to design a method of, and supporting software for, interactive correction and semantic annotation of narrative clinical reports, which would allow for their easier and less erroneous processing outside their original context: first, by physicians unfamiliar with the original language (and possibly also the source specialty), and second, by tools requiring structured information, such as decision-support systems. Our additional goal is to gain insights into the process of narrative report creation, including the errors and ambiguities arising therein, and also into the process of report annotation by clinical terms. Finally, we also aim to provide a dataset of ground-truth transformations (specific for Czech as the source language), set up by expert physicians, which can be reused in the future for subsequent analytical studies and for training automated transformation procedures. METHODS: A three-phase preprocessing method has been developed to support secondary use of narrative clinical reports in electronic health record. Narrative clinical reports are narrative texts of healthcare documentation often stored in electronic health records. In the first phase a narrative clinical report is tokenized. In the second phase the tokenized clinical report is normalized. The normalized clinical report is easily readable for health professionals with the knowledge of the language used in the narrative clinical report. In the third phase the normalized clinical report is enriched with extracted structured information. The final result of the third phase is a semi-structured normalized clinical report where the extracted clinical terms are matched to codebook terms. Software tools for interactive correction, expansion and semantic annotation of narrative clinical reports has been developed and the three-phase preprocessing method validated in the cardiology area. RESULTS: The three-phase preprocessing method was validated on 49 anonymous Czech narrative clinical reports in the field of cardiology. Descriptive statistics from the database of accomplished transformations has been calculated. Two cardiologists participated in the annotation phase. The first cardiologist annotated 1500 clinical terms found in 49 narrative clinical reports to codebook terms using the classification systems ICD 10, SNOMED CT, LOINC and LEKY. The second cardiologist validated annotations of the first cardiologist. The correct clinical terms and the codebook terms have been stored in a database. CONCLUSIONS: We extracted structured information from Czech narrative clinical reports by the proposed three-phase preprocessing method and linked it to electronic health records. The software tool, although generic, is tailored for Czech as the specific language of electronic health record pool under study. This will provide a potential etalon for porting this approach to dozens of other less-spoken languages. Structured information can support medical decision making, quality assurance tasks and further medical research.
- MeSH
- elektronické zdravotní záznamy normy MeSH
- mezinárodní klasifikace nemocí MeSH
- psaní normy MeSH
- řízený slovník * MeSH
- sémantika * MeSH
- směrnice jako téma MeSH
- smysluplné využití normy MeSH
- software MeSH
- správnost dat MeSH
- strojové učení * MeSH
- uživatelské rozhraní počítače MeSH
- zpracování přirozeného jazyka * MeSH
- zpracování textu normy MeSH
- Publikační typ
- časopisecké články MeSH
BACKGROUND: Immune-response (IR) genes have an important role in the defense against highly variable pathogens, and therefore, diversity in these genomic regions is essential for species' survival and adaptation. Although current genome assemblies from Old World camelids are very useful for investigating genome-wide diversity, demography and population structure, they have inconsistencies and gaps that limit analyses at local genomic scales. Improved and more accurate genome assemblies and annotations are needed to study complex genomic regions like adaptive and innate IR genes. RESULTS: In this work, we improved the genome assemblies of the three Old World camel species - domestic dromedary and Bactrian camel, and the two-humped wild camel - via different computational methods. The newly annotated dromedary genome assembly CamDro3 served as reference to scaffold the NCBI RefSeq genomes of domestic Bactrian and wild camels. These upgraded assemblies were then used to assess nucleotide diversity of IR genes within and between species, and to compare the diversity found in immune genes and the rest of the genes in the genome. We detected differences in the nucleotide diversity among the three Old World camelid species and between IR gene groups, i.e., innate versus adaptive. Among the three species, domestic Bactrian camels showed the highest mean nucleotide diversity. Among the functionally different IR gene groups, the highest mean nucleotide diversity was observed in the major histocompatibility complex. CONCLUSIONS: The new camel genome assemblies were greatly improved in terms of contiguity and increased size with fewer scaffolds, which is of general value for the scientific community. This allowed us to perform in-depth studies on genetic diversity in immunity-related regions of the genome. Our results suggest that differences of diversity across classes of genes appear compatible with a combined role of population history and differential exposures to pathogens, and consequent different selective pressures.
Background: Developmental coordination disorder (DCD) is described as a motor skill disorder characterized by a marked impairment in the development of motor coordination abilities that significantly interferes with performance of daily activities and/or academic achievement. Since some electrophysiological studies suggest differences between children with/without motor development problems, we prepared an experimental protocol and performed electrophysiological experiments with the aim of making a step toward a possible diagnosis of this disorder using the event-related potentials (ERP) technique. The second aim is to properly annotate the obtained raw data with relevant metadata and promote their long-term sustainability. Results: The data from 32 school children (16 with possible DCD and 16 in the control group) were collected. Each dataset contains raw electroencephalography (EEG) data in the BrainVision format and provides sufficient metadata (such as age, gender, results of the motor test, and hearing thresholds) to allow other researchers to perform analysis. For each experiment, the percentage of ERP trials damaged by blinking artifacts was estimated. Furthermore, ERP trials were averaged across different participants and conditions, and the resulting plots are included in the manuscript. This should help researchers to estimate the usability of individual datasets for analysis. Conclusions: The aim of the whole project is to find out if it is possible to make any conclusions about DCD from EEG data obtained. For the purpose of further analysis, the data were collected and annotated respecting the current outcomes of the International Neuroinformatics Coordinating Facility Program on Standards for Data Sharing, the Task Force on Electrophysiology, and the group developing the Ontology for Experimental Neurophysiology. The data with metadata are stored in the EEG/ERP Portal.
- MeSH
- akustická stimulace MeSH
- datové kurátorství MeSH
- dítě MeSH
- elektroencefalografie MeSH
- evokované potenciály MeSH
- komorbidita MeSH
- kvantitativní znak dědičný MeSH
- lidé MeSH
- počítačová simulace MeSH
- poruchy motorických dovedností diagnóza MeSH
- reakční čas MeSH
- reprodukovatelnost výsledků MeSH
- software MeSH
- světelná stimulace MeSH
- Check Tag
- dítě MeSH
- lidé MeSH
- mužské pohlaví MeSH
- ženské pohlaví MeSH
- Publikační typ
- časopisecké články MeSH
The tarnished plant bug (TPB), Lygus lineolaris (Palisot de Beauvois) is a polyphagous, phytophagous insect that has emerged as a major pest of cotton, alfalfa, fruits, and vegetable crops in the eastern United States and Canada. Using its piercing-sucking mouthparts, TPB employs a "lacerate and flush" feeding strategy in which saliva injected into plant tissue degrades cell wall components and lyses cells whose contents are subsequently imbibed by the TPB. It is known that a major component of TPB saliva is the polygalacturonase enzymes that degrade the pectin in the cell walls. However, not much is known about the other components of the saliva of this important pest. In this study, we explored the salivary gland transcriptome of TPB using Illumina sequencing. After in silico conversion of RNA sequences into corresponding polypeptides, 25,767 putative proteins were discovered. Of these, 19,540 (78.83%) showed significant similarity to known proteins in the either the NCBI nr or Uniprot databases. Gene ontology (GO) terms were assigned to 7,512 proteins, and 791 proteins in the sialotranscriptome of TPB were found to collectively map to 107 Kyoto Encyclopedia of Genes and Genomes (KEGG) database pathways. A total of 3,653 Pfam domains were identified in 10,421 sialotranscriptome predicted proteins resulting in 12,814 Pfam annotations; some proteins had more than one Pfam domain. Functional annotation revealed a number of salivary gland proteins that potentially facilitate degradation of host plant tissues and mitigation of the host plant defense response. These transcripts/proteins and their potential roles in TPB establishment are described.
- MeSH
- anotace sekvence MeSH
- genová ontologie MeSH
- Heteroptera genetika růst a vývoj metabolismus MeSH
- hmyzí geny genetika MeSH
- slinné žlázy metabolismus MeSH
- stanovení celkové genové exprese * MeSH
- zvířata MeSH
- Check Tag
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
- práce podpořená grantem MeSH
Recent research has already shown that circular RNAs (circRNAs) are functional in gene expression regulation and potentially related to diseases. Due to their stability, circRNAs can also be used as biomarkers for diagnosis. However, the function of most circRNAs remains unknown, and it is expensive and time-consuming to discover it through biological experiments. In this paper, we predict circRNA annotations from the knowledge of their interaction with miRNAs and subsequent miRNA-mRNA interactions. First, we construct an interaction network for a target circRNA and secondly spread the information from the network nodes with the known function to the root circRNA node. This idea itself is not new; our main contribution lies in proposing an efficient and exact deterministic procedure based on the principle of probability-generating functions to calculate the p-value of association test between a circRNA and an annotation term. We show that our publicly available algorithm is both more effective and efficient than the commonly used Monte-Carlo sampling approach that may suffer from difficult quantification of sampling convergence and subsequent sampling inefficiency. We experimentally demonstrate that the new approach is two orders of magnitude faster than the Monte-Carlo sampling, which makes summary annotation of large circRNA files feasible; this includes their reannotation after periodical interaction network updates, for example. We provide a summary annotation of a current circRNA database as one of our outputs. The proposed algorithm could be generalized towards other types of RNA in way that is straightforward.
BACKGROUND: The mammalian Natural Killer Complex (NKC) harbors genes and gene families encoding a variety of C-type lectin-like proteins expressed on various immune cells. The NKC is a complex genomic region well-characterized in mice, humans and domestic animals. The major limitations of automatic annotation of the NKC in non-model animals include short-read based sequencing, methods of assembling highly homologous and repetitive sequences, orthologues missing from reference databases and weak expression. In this situation, manual annotations of complex genomic regions are necessary. METHODS: This study presents a manual annotation of the genomic structure of the NKC region in a high-quality reference genome of the domestic cat and compares it with other felid species and with representatives of other carnivore families. Reference genomes of Carnivora, irrespective of sequencing and assembly methods, were screened by BLAST to retrieve information on their killer cell lectin-like receptor (KLR) gene content. Phylogenetic analysis of in silico translated proteins of expanded subfamilies was carried out. RESULTS: The overall genomic structure of the NKC in Carnivora is rather conservative in terms of its C-type lectin receptor gene content. A novel KLRH-like gene subfamily (KLRL) was identified in all Carnivora and a novel KLRJ-like gene was annotated in the Mustelidae. In all six families studied, one subfamily (KLRC) expanded and experienced pseudogenization. The KLRH gene subfamily expanded in all carnivore families except the Canidae. The KLRL gene subfamily expanded in carnivore families except the Felidae and Canidae, and in the Canidae it eroded to fragments. CONCLUSIONS: Knowledge of the genomic structure and gene content of the NKC region is a prerequisite for accurate annotations of newly sequenced genomes, especially of endangered wildlife species. Identification of expressed genes, pseudogenes and gene fragments in the context of expanded gene families would allow the assessment of functionally important variability in particular species.
- MeSH
- anotace sekvence MeSH
- buňky NK * imunologie metabolismus MeSH
- Carnivora * genetika MeSH
- fylogeneze * MeSH
- genom MeSH
- genomika * metody MeSH
- kočky genetika MeSH
- lektiny typu C genetika MeSH
- zvířata MeSH
- Check Tag
- kočky genetika MeSH
- zvířata MeSH
- Publikační typ
- časopisecké články MeSH
- srovnávací studie MeSH
BACKGROUND: One of the major challenges in the analysis of gene expression data is to identify local patterns composed of genes showing coherent expression across subsets of experimental conditions. Such patterns may provide an understanding of underlying biological processes related to these conditions. This understanding can further be improved by providing concise characterizations of the genes and situations delimiting the pattern. RESULTS: We propose a method called semantic biclustering with the aim to detect interpretable rectangular patterns in binary data matrices. As usual in biclustering, we seek homogeneous submatrices, however, we also require that the included elements can be jointly described in terms of semantic annotations pertaining to both rows (genes) and columns (samples). To find such interpretable biclusters, we explore two strategies. The first endows an existing biclustering algorithm with the semantic ingredients. The other is based on rule and tree learning known from machine learning. CONCLUSIONS: The two alternatives are tested in experiments with two Drosophila melanogaster gene expression datasets. Both strategies are shown to detect sets of compact biclusters with semantic descriptions that also remain largely valid for unseen (testing) data. This desirable generalization aspect is more emphasized in the strategy stemming from conventional biclustering although this is traded off by the complexity of the descriptions (number of ontology terms employed), which, on the other hand, is lower for the alternative strategy.
BACKGROUND: The advancement of sequencing technologies results in the rapid release of hundreds of new genome assemblies a year providing unprecedented resources for the study of genome evolution. Within this context, the significance of in-depth analyses of repetitive elements, transposable elements (TEs) in particular, is increasingly recognized in understanding genome evolution. Despite the plethora of available bioinformatic tools for identifying and annotating TEs, the phylogenetic distance of the target species from a curated and classified database of repetitive element sequences constrains any automated annotation effort. Moreover, manual curation of raw repeat libraries is deemed essential due to the frequent incompleteness of automatically generated consensus sequences. RESULTS: Here, we present an example of a crowd-sourcing effort aimed at curating and annotating TE libraries of two non-model species built around a collaborative, peer-reviewed teaching process. Manual curation and classification are time-consuming processes that offer limited short-term academic rewards and are typically confined to a few research groups where methods are taught through hands-on experience. Crowd-sourcing efforts could therefore offer a significant opportunity to bridge the gap between learning the methods of curation effectively and empowering the scientific community with high-quality, reusable repeat libraries. CONCLUSIONS: The collaborative manual curation of TEs from two tardigrade species, for which there were no TE libraries available, resulted in the successful characterization of hundreds of new and diverse TEs in a reasonable time frame. Our crowd-sourcing setting can be used as a teaching reference guide for similar projects: A hidden treasure awaits discovery within non-model organisms.
- Publikační typ
- časopisecké články MeSH
Secondary hyperparathyroidism is a well-known complication of end-stage renal disease (ESRD). Both nodular and diffuse parathyroid hyperplasia occur in ESRD patients. However, their distinct molecular mechanisms remain poorly understood. Parathyroid tissue obtained from ESRD patients who had undergone parathyroidectomy was used for Illumina transcriptome screening and subsequently for discriminatory gene analysis, pathway mapping, and gene annotation enrichment analysis. Results were further validated using quantitative RT-PCR on the independent larger cohort. Microarray screening proved homogeneity of gene transcripts in hemodialysis patients compared with the transplant cohort and primary hyperparathyroidism; therefore, further experiments were performed in hemodialysis patients only. Enrichment analysis conducted on 485 differentially expressed genes between nodular and diffuse parathyroid hyperplasia revealed highly significant differences in Gene Ontology terms and the Kyoto Encyclopedia of Genes and Genomes database in ribosome structure (P = 3.70 × 10(-18)). Next, quantitative RT-PCR validation of the top differently expressed genes from microarray analysis proved higher expression of RAN guanine nucleotide release factor (RANGRF; P < 0.001), calcyclin-binding protein (CACYBP; P < 0.05), and exocyst complex component 8 (EXOC8; P < 0.05) and lower expression of peptidylprolyl cis/trans-isomerase and NIMA-interacting 1 (PIN1; P < 0.01) mRNA in nodular hyperplasia. Multivariate analysis revealed higher RANGRF and lower PIN1 expression along with parathyroid weight to be associated with nodular hyperplasia. In conclusion, our study suggests the RANGRF transcript, which controls RNA metabolism, to be likely involved in pathways associated with the switch to nodular parathyroid growth. This transcript, along with PIN1 transcript, which influences parathyroid hormone secretion, may represent new therapeutical targets to cure secondary hyperparathyroidism.
- MeSH
- chronické selhání ledvin komplikace terapie MeSH
- dialýza ledvin * MeSH
- dospělí MeSH
- fokální nodulární hyperplazie etiologie genetika terapie MeSH
- lidé středního věku MeSH
- lidé MeSH
- messenger RNA biosyntéza genetika MeSH
- multigenová rodina genetika MeSH
- parathormon krev MeSH
- paratyreoidea patologie MeSH
- paratyreoidektomie MeSH
- primární hyperparatyreóza patologie MeSH
- regulace genové exprese genetika MeSH
- sekundární hyperparatyreóza etiologie genetika terapie MeSH
- senioři MeSH
- stanovení celkové genové exprese MeSH
- transkriptom genetika MeSH
- Check Tag
- dospělí MeSH
- lidé středního věku MeSH
- lidé MeSH
- mužské pohlaví MeSH
- senioři MeSH
- ženské pohlaví MeSH
- Publikační typ
- časopisecké články MeSH
BACKGROUND: RNA-sequencing analysis is increasingly utilized to study gene expression in non-model organisms without sequenced genomes. Aethionema arabicum (Brassicaceae) exhibits seed dimorphism as a bet-hedging strategy - producing both a less dormant mucilaginous (M+) seed morph and a more dormant non-mucilaginous (NM) seed morph. Here, we compared de novo and reference-genome based transcriptome assemblies to investigate Ae. arabicum seed dimorphism and to evaluate the reference-free versus -dependent approach for identifying differentially expressed genes (DEGs). RESULTS: A de novo transcriptome assembly was generated using sequences from M+ and NM Ae. arabicum dry seed morphs. The transcripts of the de novo assembly contained 63.1% complete Benchmarking Universal Single-Copy Orthologs (BUSCO) compared to 90.9% for the transcripts of the reference genome. DEG detection used the strict consensus of three methods (DESeq2, edgeR and NOISeq). Only 37% of 1533 differentially expressed de novo assembled transcripts paired with 1876 genome-derived DEGs. Gene Ontology (GO) terms distinguished the seed morphs: the terms translation and nucleosome assembly were overrepresented in DEGs higher in abundance in M+ dry seeds, whereas terms related to mRNA processing and transcription were overrepresented in DEGs higher in abundance in NM dry seeds. DEGs amongst these GO terms included ribosomal proteins and histones (higher in M+), RNA polymerase II subunits and related transcription and elongation factors (higher in NM). Expression of the inferred DEGs and other genes associated with seed maturation (e.g. those encoding late embryogenesis abundant proteins and transcription factors regulating seed development and maturation such as ABI3, FUS3, LEC1 and WRI1 homologs) were put in context with Arabidopsis thaliana seed maturation and indicated that M+ seeds may desiccate and mature faster than NM. The 1901 transcriptomic DEG set GO-terms had almost 90% overlap with the 2191 genome-derived DEG GO-terms. CONCLUSIONS: Whilst there was only modest overlap of DEGs identified in reference-free versus -dependent approaches, the resulting GO analysis was concordant in both approaches. The identified differences in dry seed transcriptomes suggest mechanisms underpinning previously identified contrasts between morphology and germination behaviour of M+ and NM seeds.
- MeSH
- anotace sekvence MeSH
- Brassicaceae genetika růst a vývoj MeSH
- genom rostlinný MeSH
- genová ontologie MeSH
- klíčení MeSH
- regulace genové exprese u rostlin * MeSH
- rostlinné proteiny genetika MeSH
- semena rostlinná genetika růst a vývoj MeSH
- stanovení celkové genové exprese MeSH
- transkriptom * MeSH
- vysoce účinné nukleotidové sekvenování MeSH
- Publikační typ
- časopisecké články MeSH