WCG PostHPF1 wrapup and scientific
[Jul 18, 2006 10:04:18 PM]
HPF1 wrap up:
HPF1 is over on the grid. This doesn't mean its over for the scientists. The calculations are the raw footage, now we have to edit it together to make the movie. Here is a detailed summary of our progress so far (actually this is medium level of detail... the papers describing this work are en route and I'll put up the links when they're up, we're submitting for publication in open access venues).
Scaling up de novo structure prediction, Rosetta on the WCGrid. We have used an innovative and cost-effective distributed/grid system (thats you!) for generation of structures required for this project. Currently there are over 200,000 (people who have downloaded the client to run grid-Rosetta) comprising a network of over 370,000 processors. This amount of computational power enables us to remove the barrier represented by the computational cost of structure prediction methods. We can also, for the first time, experiment with all atom (higher resolution) structure prediction methods at the genome wide scale (thats hpf2).
The first step in our genome-scale folding was domain parsing; proteins are parsed into structural domains using Ginzu. Splitting proteins into their component structural domains greatly aids all downstream structure and function based methods (especially given size limitations of structure prediction methods). Ginzu was used successfully in CASP4-6 to delineate domains within queries. During the first phase of the structure generation (Proteome Folding Project on wcgrid) domains from > 150 genomes were predicted. Over 150,000 domains in these genomes were only approachable by the Rosetta method and the downstream structure-based methods proposed herein. That represents a very sizable computation and a big contribution (in terms of number of unannotated domains approached) Our access to the grid allowed us to devote our computer and human resources to the main objectives of this proposal, the new science behind annotating and distributing the results to the biological community. The number of genomes processed is currently set as stated (150 genomes spanning the tree of life) but we will endeavor to expand this list of genomes as additional blocks of time on the grid become available.
Early results on yeast: Integrating predicted models with GO. We used grid-Rosetta and Ginzu to reannotate the yeast genome (collaboration with D. Baker). In yeast 38% of domains had a sequence detectable homolog of known structure and an additional 9% could confidently be annotated by fold recognition methods. In our recent work on yeast we relied on the protein function encoding describe by GO-function. For the encoding of structure we rely on the SCOP database. Over 1000 confident Rosetta de novo results were produced by this recent study, significantly increasing the number of yeast domains having one or more structure and/or function annotation.
Baysian framework for predicting function using fold and localization and/or process: We have recently demonstrated that integration with GO process and localization allows us to automatically resolve some of the functional ambiguities associated with structure alone. Given localization, L, biological process, P, and predicted structures, D, we derived GO-function. These steps (aims II.b and II.c) have been carried out with success for Saccharomyces cerevisiae.
The database is up for yeast and we are working to finish the post-processing for the other 149 genomes (including human). The results will be served via bioNetBuilder (check my lab website in a few weeks) and via a similar web page (again, check the bonneau lab website in a few weeks). Until then we're beta-testing by releasing our results to the yeast community (who are an active community responsible for great numbers of biomedical discoveries).
An example prediction: http://www.yeastrc.org/pdr/viewPSPOverview.do?id=533920#DN1, shows that the GO-integration increased our confidence that this subunit of the mediator complex (MED4) contained a N-terminal DNA-binding domain with a homeodomain like fold.
Our second paper from the grid has been submitted. Lars, David and I are working on it now (after reviewer comments). Hopefully its published soon, but this process usually drags on for months (accuracy and academic integrity require patience).
on to hpf2!