|
Ab Initio Structure Prediction and NMR Validation of Small Proteins with No Sequence Identity in Bacillus Subtilis Genome
original research proposal
by Eduard Chekmenev - Department of Chemistry - - University of Louisville, February 8, 2002 - A. SPECIFIC AIMS
Specific aim #1 is to use a homology based search to select 3 B. subtilis proteins with unknown function that are likely to adopt new folds. Over the past 5 years many genomes have been sequenced [1, 2]. The protein targets for this project will be selected from the Bacillus subtilis (B. subtilis) genome [3], the best-characterized member of Gram-positive bacteria. B. subtilis produces many small polypeptides of less than 85 residues [4]. Some of those polypeptides show antimicrobial activity or affect internal signal transduction. Most of these small proteins (about 81%) have no known function. Only the primary sequence is known to date [4]. I hypothesize that new folds may be found within small polypeptides of B. subtilis. The proteins to be selected have unknown function and structure. First, the chosen sequences are tested for sequence identity. This screening procedure involves the comparison of the polypeptide sequence to known sequences/structures in various databases [5, 6]. It is anticipated that proteins with the lowest sequence homology are more likely to have an unknown fold. The three most interesting candidates (those with lowest sequence homology) will be used for ab initio structure prediction and verification by NMR methodology. Specific aim #2 is the rapid determination of the structure/fold of small proteins (less than 85 residues). This approach is based upon ab initio structure prediction and NMR residual dipolar coupling (RDC) measurements in liquid crystal media without peak assignments. First, a set of reasonable 3D structures is predicted ab initio for each protein. Second, measured RDCs in aqueous medium are used to select the best structure. Importantly, the peak assignment step is unnecessary. In the final step, RDC's are used to refine the best ab initio structure. This methodology will significantly reduce experimental effort and time. ABBREVIATIONS AND NAMES
BLAST - Basic Local Alignment Search Tool; blastp - one of the program modules of BLAST; B. subtilis - Bacillus subtilis; CO - contact order; CPU - central processor unit; DC - dipolar coupling; Fmoc - 9-fluorenylmethoxycarbonyl; HSQC - Heteronuclear Single Quantum Coherence; LC - liquid crystal; MALDI-MS - matrix-assisted laser desorption ionization-mass spectrometry; MSA - multiple sequence alignment; NIH - National Institute of Health; NOE - nuclear Overhauser effect; nr - protein database; ORF - open reading frame; PALES - prediction of alignment from structure; PDB - Protein Data Bank; RDC - residual dipolar coupling; rmsd - root mean square deviation (refers to coordinates); Rosetta - ab initio algorithm developed by Baker et al.; S - order parameter of polypeptide macromolecule; SPORF - small protein open reading frame; SPPS - solid phase peptide synthesis; swissprot - protein database; B. BACKGROUND AND SIGNIFICANCE
Ab initio computational methods are increasingly widespread in chemistry. A huge boost in computational power over the past 20 years [7] has made ab initio calculations an every day tool for chemists. The capabilities will be even more powerful since the computational power is expected to continue doubling every 18-24 months in the foreseeable future (10 years) [8]. Scientific demands are now expanded to structure prediction/calculation of large systems such as proteins, DNA fragments, etc. In this case the task is still feasible but only over an infinite period of time because of the large size of bio-molecules. Another major problem in the area of protein structure prediction is the abundance of local minima. To overcome time problem one has to sacrifice structure resolution/precision to obtain computational results within a reasonable amount of time. On the other hand the local minima problem can be solved if some structural constraints are applied to the system. State-of-art algorithm called Rosetta implemented in CASP4 handles both problems in area of protein structure prediction [9]. This method is capable of producing the correct protein structure having a root mean square deviation (rmsd) of 6.5 A. Importantly, calculations start with amino acid sequence information only. The disadvantages include low resolution and multitude of predicted structures. Multitude means that several structures are usually generated that are ranked by the probability being native. The size of proteins that can be computed is limited to 100 to 300 residues [9]. Experimentally, protein structure is most frequently determined by X-ray crystallography and/or NMR spectroscopy with precision down to 1-2 A [10]. The era of structural genomics puts time constraints on the experimental methods as well. Thus, rapid determination of protein structure is very important [11]. Two years ago the National Institute of Health (NIH) funded 7 pilot centers in the United States for testing and establishing protocols for the large-scale production of protein structures [12]. Most of those NIH centers will rely on X-ray crystallography. Unfortunately, NMR and X-ray protocols are still time consuming. The vast majority of NMR methods requires NMR resonance assignments, whereas X-ray crystallography requires crystalline proteins and tedious phase determination [11]. NIH plans to determine 10,000 new protein structures in the next 10 years using current tools available [13]. The genomic scale (105 proteins for homo sapiens), however, requires more structures to be classified. From this perspective one would desire to trade the superior resolution of current experimental techniques for reduction in the time spent per one structure when the low-resolution structures are acceptable. Selection of biologically interesting proteins for structure determination is a key issue. B. subtilis is an aerobic, rod-shaped bacterium usually found in soil, water sources and in association with plants. B. subtilis and similar bacteria are an important source of industrial enzymes such as mylases and proteases. The commercial interest in these bacteria arises from their capacity to secrete these enzymes at gram per liter concentrations. Consequently, protein secretion of this bacterium is studied for commercial applications [14]. 42% of the gene products (out of 4,100) in B. subtilis genome have unknown function [3]. Such a percentage corresponds to the average value in most decoded genomes. There is 10% homology among B. subtilis proteins with unknown function and 29% homology with proteins of unknown function in other genomes. When proteins are homologous their function is potentially related. Such connectivity can be used for protein function determination. Therefore, structural information on 1/2 of this 42% of B. subtilis proteins provides additional structural and functional information about the group as a whole. In turn, this is useful for understanding homologous proteins in other organisms. Another interesting feature of B. subtilis is the abundance of small polypeptide open reading frames (SPORFs). This represents 8.4% of the B. subtilis genes and is higher than in any other known organism. The SPORFs are also thought to be responsible for antimicrobial activity and action of internal signal transduction systems. 81% of the SPORFs gene products have unknown function. From this point of view and that of the protein folding problem, SPORFs are the best choice for discovery of a new fold in proteins with unknown functions. There are also reasons to believe that many more active SPORFs have escaped detection and await discovery [4]. Literature Cited
1. Venter, J.G. et al., The Sequence of the Human Genome. Science, 2001. 291: p. 1304-1351. 2. Pedersen, A.G.; Jensen, L.J.; Brunak, S.; Stnrfeldt, H-H. and Ussery, D.W., A DNA Structural Atlas for Escherichia coli. J. Mol. Biol., 2000. 299: p. 907-930. 3. Kunst, F. et al., The complete genome sequence of the Gram-positive bacterium Bacillus subtilis. Nature, 1997. 390(Nov. 20): p. 249-266. 4. Zuber, P., A peptide profile of the Bacillus subtilis genome. Peptides, 2001. 22: p. 1555-1577. 5. Altschul, S.F.; Madden, T.L.; Schaffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W. and Lipman, D.J., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 1997. 25: p. 3389-3402. 6. Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W. and Lipman, D.J., Basic local alignment search tool. J. Mol. Biol., 1990. 215: p. 403-410. 7. Shojiro, A. and Wada, Y., Technology Challenges for Integration Near and Below 0.1 mm. Proceedings of the IEEE, 1997. 85(4): p. 505-520. 8. Schulz, M., The end of the road for silicon? Science, 1999. 399: p. 729-730. 9. Bonneau, R.; Tsai, J.; Ruczinski, I.; Chivian, D.; Rohl, C.; Strauss, C.E.M. and Baker, D., Rosetta in CASP4: Progress in ab initio protein structure prediction. Proteins: Structure, Function and Genetics, 2001. In press. 10. Baker, D. and Sali, A., Protein Structure Prediction and Structural Genomics. Science, 2001. 294: p. 93-96. 11. Prestegard, J.H.; Valafar, H. and Tian, F., Nuclear Magnetic Resonance in the Era of Structural Genomics. Biochemistry, 2001. 40: p. 8677-8685. 12. Norvell, J.C. and Machalek, A.Z., Structural genomics programs at the US National Institute of General Medical Sciences. Nat. Struct. Biol., 2000. 7: p. 931. 13. Terwilliger, T.C., Structural genomics in North America. Nat. Struct. Biol., 2000. 7: p. 935-939. 14. Harwood, C. R., Bacillus subtilis and its relatives: molecular biological and industrial workhorses. Trends Biotechnol., 1992. 10: p. 247-256. |