Our latest structure research paper has been accepted by Genome Research (print publication shortly)! This paper is major, and describes the fundamentals of our recent work in protein structure and function prediction. See the full text here: The proteome folding project: Proteome-scale prediction of structure and function, or check it out on our Publications page.
The incompleteness of proteome structure and function annotation is a critical problem for biologists and, in particular, severely limits interpretation of high-throughput and next-generation experiments. We have developed a proteome annotation pipeline based on structure prediction, where function and structure annotations are generated using an integration of sequence comparison, fold recognition and grid-computing enabled de novo structure prediction. We predict protein domain boundaries and 3D structures for protein domains from 94 genomes (including Human, Arabidopsis, Rice, Mouse, Fly, Yeast, E. coli and Worm). De novo structure predictions were distributed on a grid of over 1.5 million CPUs worldwide (World Community Grid). We generate significant numbers of new confident fold annotations (9% of domains that are otherwise unannotated in these genomes). We demonstrate that predicted structures can be combined with annotations from the Gene Ontology database to predict new and more specific molecular functions.
Currently a large number of proteins in all genomes have unknown function, and the vast majority do not have any annotation of protein structure (i.e., no homology to a sequence with known structure). Although large amounts of genome-wide data are accumulating, these data sets in general do not provide specific molecular function for these proteins. We are developing and applying methods to extend the annotation of the Arabidopsis thaliana and, secondarily, rice genomes by incorporating protein structural information. This work has recently resulted in a collaborative grant to apply our structure prediction methods to several plant genomes with Mike Purugannan and Gloria Coruzzi.
These methods are based on our successful structure-based functional annotation of the, among many others, yeast, human and Halobacterium genomes. Structure-based methods provide an alternate route to function-annotation, yet the automatic interpretation of structure-function relationships is non-trivial and requires addressing several data integration challenges. We address these challenges by a novel integration of best-in-class structure prediction (including Rosetta de novo structure prediction), phylogenetic and evolutionary analysis, and automated protein function prediction.
We use the World Community Grid as our computational platform, overcoming the vast computational barrier to genome-wide de novo structure prediction. Phylogenies of domains of gene families are estimated, and amino acid sites under positive selection during the evolution of these gene families identified by codon-based models. These results are then mapped onto the predicted structures, providing a structure-based evolutionary annotation of protein domains. No similar resource or set of tools is currently available to the Arabidopsis community, and we expect this approach to yield large numbers of correct structure and function predictions for the community, which would serve as a foundation for future genetic, genomic and proteomic analyses.