CATH: increased structural coverage of functional space
Language English Country England, Great Britain Media print
Document type Journal Article, Research Support, Non-U.S. Gov't
Grant support
Wellcome Trust - United Kingdom
203780/Z/16/A
Wellcome Trust - United Kingdom
104960/Z/14/Z
Wellcome Trust - United Kingdom
PubMed
33237325
PubMed Central
PMC7778904
DOI
10.1093/nar/gkaa1079
PII: 6006195
Knihovny.cz E-resources
- MeSH
- Molecular Sequence Annotation MeSH
- COVID-19 epidemiology prevention & control virology MeSH
- Databases, Protein statistics & numerical data MeSH
- Epidemics MeSH
- Internet MeSH
- Humans MeSH
- Protein Domains * MeSH
- Proteins chemistry genetics metabolism MeSH
- SARS-CoV-2 genetics metabolism physiology MeSH
- Amino Acid Sequence MeSH
- Sequence Analysis, Protein methods MeSH
- Sequence Homology, Amino Acid MeSH
- Viral Proteins chemistry genetics metabolism MeSH
- Computational Biology methods statistics & numerical data MeSH
- Check Tag
- Humans MeSH
- Publication type
- Journal Article MeSH
- Research Support, Non-U.S. Gov't MeSH
- Names of Substances
- Proteins MeSH
- Viral Proteins MeSH
CATH (https://www.cathdb.info) identifies domains in protein structures from wwPDB and classifies these into evolutionary superfamilies, thereby providing structural and functional annotations. There are two levels: CATH-B, a daily snapshot of the latest domain structures and superfamily assignments, and CATH+, with additional derived data, such as predicted sequence domains, and functionally coherent sequence subsets (Functional Families or FunFams). The latest CATH+ release, version 4.3, significantly increases coverage of structural and sequence data, with an addition of 65,351 fully-classified domains structures (+15%), providing 500 238 structural domains, and 151 million predicted sequence domains (+59%) assigned to 5481 superfamilies. The FunFam generation pipeline has been re-engineered to cope with the increased influx of data. Three times more sequences are captured in FunFams, with a concomitant increase in functional purity, information content and structural coverage. FunFam expansion increases the structural annotations provided for experimental GO terms (+59%). We also present CATH-FunVar web-pages displaying variations in protein sequences and their proximity to known or predicted functional sites. We present two case studies (1) putative cancer drivers and (2) SARS-CoV-2 proteins. Finally, we have improved links to and from CATH including SCOP, InterPro, Aquaria and 2DProt.
See more in PubMed
Orengo C., Michie A., Jones S., Jones D., Swindells M., Thornton J.. CATH – a hierarchic classification of protein domain structures. Structure. 1997; 5:1093–1109. PubMed
Pearl F.M.G., Bennett C.F., Bray J.E., Harrison A.P., Martin N., Shepherd A., Sillitoe I., Thornton J., Orengo C.A.. The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res. 2003; 31:452–455. PubMed PMC
Sillitoe I., Dawson N., Lewis T.E., Das S., Lees J.G., Ashford P., Tolulope A., Scholes H.M., Senatorov I., Bujan A. et al. .. CATH: expanding the horizons of structure-based functional annotations for genome sequences. Nucleic Acids Res. 2019; 47:D280–D284. PubMed PMC
Lewis T.E., Sillitoe I., Dawson N., Lam S.D., Clarke T., Lee D., Orengo C., Lees J.. Gene3D: Extensive prediction of globular domains in proteins. Nucleic Acids Res. 2018; 46:D435–D439. PubMed PMC
The UniProt Consortium UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019; 47:D506–D515. PubMed PMC
Yates A.D., Achuthan P., Akanni W., Allen J., Allen J., Alvarez-Jarreta J., Amode M.R., Armean I.M., Azov A.G., Bennett R. et al. .. Ensembl 2020. Nucleic Acids Res. 2019; 47:D745–D751. PubMed PMC
Orengo C.A., Taylor W.R.. SSAP: Sequential structure alignment program for protein structure comparison. Methods in Enzymology. 1996; 266:Elsevier; 617–635. PubMed
Das S., Lee D., Sillitoe I., Dawson N.L., Lees J.G., Orengo C.A.. Functional classification of CATH superfamilies: a domain-based approach for protein function annotation. Bioinformatics. 2015; 31:3460–3467. PubMed PMC
Katoh K., Standley D.M.. MAFFT multiple sequence alignment software Version 7: improvements in performance and usability. Mol. Biol. Evol. 2013; 30:772–780. PubMed PMC
Mistry J., Finn R.D., Eddy S.R., Bateman A., Punta M.. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res. 2013; 41:e121. PubMed PMC
Huntley R.P., Sawford T., Mutowo-Meullenet P., Shypitsyna A., Bonilla C., Martin M.J., O’Donovan C.. The GOA database: Gene Ontology annotation updates for 2015. Nucleic Acids Res. 2015; 43:D1057–D1063. PubMed PMC
Jiang Y., Oron T.R., Clark W.T., Bankapur A.R., D’Andrea D., Lepore R., Funk C.S., Kahanda I., Verspoor K.M., Ben-Hur A. et al. .. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 2016; 17:184. PubMed PMC
Zhou N., Jiang Y., Bergquist T.R., Lee A.J., Kacsoh B.Z., Crocker A.W., Lewis K.A., Georghiou G., Nguyen H.N., Hamid M.N. et al. .. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 2019; 20:244. PubMed PMC
Valdar W.S.J. Scoring residue conservation. Proteins Struct. Funct. Genet. 2002; 48:227–241. PubMed
O’Donoghue S.I., Sabir K.S., Kalemanov M., Stolte C., Wellmann B., Ho V., Roos M., Perdigão N., Buske F.A., Heinrich J. et al. .. Aquaria: simplifying discovery and insight from protein structures. Nat. Methods. 2015; 12:98–99. PubMed
O’Donoghue S.I., Schafferhans A., Sikta N., Stolte C., Kaur S., Ho B.K., Anderson S., Procter J., Dallago C., Bordin N. et al. .. SARS-CoV-2 structural coverage map reveals state changes that disrupt host immunity bioinformatics. 2020; bioRxiv doi:28 September 2020, preprint: not peer reviewed10.1101/2020.07.16.207308. PubMed DOI PMC
Rentzsch R., Orengo C.A.. Protein function prediction using domain families. BMC Bioinformatics. 2013; 14:S5. PubMed PMC
Patani H., Bunney T.D., Thiyagarajan N., Norman R.A., Ogg D., Breed J., Ashford P., Potterton A., Edwards M., Williams S.V. et al. .. Landscape of activating cancer mutations in FGFR kinases and their differential responses to inhibitors in clinical use. Oncotarget. 2016; 7:24252–24268. PubMed PMC
Lewis T.E., Sillitoe I., Lees J.G.. cath-resolve-hits: a new tool that resolves domain matches suspiciously quickly. Bioinformatics. 2019; 35:1766–1767. PubMed PMC
Elbe S., Buckland-Merrett G.. Data, disease and diplomacy: GISAID’s innovative contribution to global health: Data, Disease and Diplomacy. Glob. Chall. 2017; 1:33–46. PubMed PMC
Shu Y., McCauley J.. GISAID: global initiative on sharing all influenza data - from vision to reality. Euro Surveill. Bull. Eur. Sur Mal. Transm. Eur. Commun. Dis. Bull. 2017; 22:30494. PubMed PMC
Gordon D.E., Jang G.M., Bouhaddou M., Xu J., Obernier K., White K.M., O’Meara M.J., Rezelj V.V., Guo J.Z., Swaney D.L. et al. .. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature. 2020; 583:459–468. PubMed PMC
Ashford P., Pang C.S.M., Moya-García A.A., Adeyelu T., Orengo C.A.. A CATH domain functional family based approach to identify putative cancer driver genes and driver mutations. Sci. Rep. 2019; 9:263. PubMed PMC
Lam S.D., Bordin N., Waman V.P., Scholes H.M., Ashford P., Sen N., van Dorp L., Rauer C., Dawson N.L., Pang C.S.M. et al. .. SARS-CoV-2 spike protein predicted to form complexes with host receptor protein orthologues from a broad range of mammals. Sci. Rep. 2020; 10:16471. PubMed PMC
Grantham R. Amino acid difference formula to help explain protein evolution. Science. 1974; 185:862–864. PubMed
PDBImages: a command-line tool for automated macromolecular structure visualization
Machine Learning-Guided Protein Engineering
2DProts: database of family-wide protein secondary structure diagrams