Advanced Course"From Data to Structures"June 10-21 2002, Utrecht NijmegenInstructions for the practical part:1. Introduction 4. Structure calculations based on manually assigned input data using CNS 5. Structure calculations from manually assigned NOEs using ARIA 6. Automated NOE assignments and structure calculations using ARIA
A.Bonvin and C.Spronk, version June 15, 2000
1.1 Introduction This practical is meant to familiarize you with structure calculation and analysis methods in use for the determination of NMR structures. A large number of programs and protocols are being used for this purpose. In this practical we will however concentrate on the use of one particular program, CNS (Crystallographic and NMR System) from Axel Brünger's laboratory at Yale university (http://xplor.csb.yale.edu). CNS is the follow-up of the widely used program X-PLOR (see www address above for more information) also developed at Yale. If using CNS, please cite the following reference in any publication: "Crystallography and NMR system (CNS): a new software system for macromolecular structure determination". Brunger A.T., Adams P.D., Clore G.M., Delano W.L., Gros P., Grosse-Kunstleve R.W., Jiang J.-S., Kuszewski J., Nilges M., Pannu N.S., Read R.J., Rice L.M., Simonson T. and Warren G.L. Acta Cryst. D54, 905-921 (1998). The interface of CNS is your web-browser and you will find that editing of parameters for standard structure calculation protocols can be easily performed using this interface. A drawback of this method is of course that the standard protocols may not work sufficiently well for specific structure calculation problems and one may have to edit the protocols at the level of the scripts. However, for most purposes the methods used in this course are fine and form a good basis for understanding structure calculations. A very promising recent development in structure calculation is the program ARIA. ARIA (http://dodo.nmr.embl-heidelberg.de:9673/EMBL/Aria) can be used in combination with CNS for automated NOE assignment and structure calculation. In principle this method is faster, with less manual intervention and allows for higher numbers of possible assignments for NOEs. For these reasons ARIA is, according to us, the way to do structure calculations in the future. This brings us to the main objectives of this practical. These will be to compute the structure of the 62-residue lac repressor headpiece protein (HP62, hp62 or lac) using various structure calculation protocols in CNS with manually assigned NOEs and chemical shifts as input data. Apart from varying the structure calculation protocols, which is only a minor part of this course, we will also look at the effect of different input data (i.e. including/excluding chemical shift data, using wrong assignments, wrong calibration of NOE-intensities etc) on the structure calculation. After some training in the use of CNS and structure validation/visualization we will run ARIA to do various automatic NOE assignment and structure calculation protocols. At the end of the course we will compare the performance and efficiency of the various structure calculation protocols. Further, we will analyse and compare the quality of structures generated with the various protocols, using both manually and automatically assigned NOEs. Try to answer the questions you encounter throughout the instructions (they are indicated by Q:). 2. Practical AspectsYou will have to form teams consisting of two persons each. Each team will be given an account (course1, course2,...) and password. You will find in those accounts the NMR data that you will need for the structure calculations (in directory nmr_data_manual). Programs we will use during the course are: Structure calculations (& analysis):
Various protocols for structure calculations can be used within CNS:
Structure analysis and quality: PROCHECK http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html PROCHECK_COMP http://www.biochem.ucl.ac.uk/~roman/procheck_comp/procheck_comp.html WHAT IF http://www.sander.embl-heidelberg.de/whatif Visualization: MOLMOL http://www.mol.biol.ethz.ch/wuthrich/software/molmol RASMOL http://www.umass.edu/microbio/rasmol Note that it is important to take a look at the online manuals of PROCHECK and MOLMOL. Further, people who are not familiar with MOLMOL are strongly recommended to do the online tutorial of the program. 3. NOE restraintsData have been collected on the HP62-DNA complex, which consists of two HP62 monomers and a symmetrical DNA sequence. Here we will focus on the structure determination of one HP62 monomer only, using data derived from the complex. The amino-acid sequence is: MKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPNRVAQQLAGKQSL Several experiments have been performed on the HP62-DNA complex: we will use NOE data from 4 different spectra obtained at 750 MHz with a mixing time of 100 ms:
In addition to NOE data we will be using carbon and proton chemical shifts to identify secondary structure elements in HP62 and convert the chemical shift information to dihedral angle restraints and to distance restraints for hydrogen-bonding patterns in alpha-helices. In case of the distance restraints we have created two types of input data for the structure determination:
Note about calibration of NOE intensities When using ARIA the NOEs are calibrated automatically using statistical methods. However NOE intensities can be calibrated manually using known proton-proton distances in proteins or nucleic acids and the approximate relationship: NOE-intensity = C / r6. Where r is the inter-proton distance and C is constant that can be used for calibration. For this one can use distances that are fixed by the covalent structure of the molecule under investigation (e.g. pairs of methylene protons or aromatic ring protons). However, these fixed distances always suffer a lot from spin-diffusion and are very inaccurate. (Note that manual calibration is always very qualitative). It is better to use known distances in secondary structure elements that you have identified in your molecule under investigation. For secondary structure elements in proteins one can use table 7.1, page 127 of Wüthrichs NMR-bible (Wüthrich, K. NMR of proteins and nucleic acids. Wiley-Interscience, New York (1986)). Here we have done the manual calibration for you based on sequential HN-HN (~2.8 Angstrom, Ha-HN (~3.5 Angstrom) and Ha-HN(i,i+3) (~3.4 Angstrom) NOEs in a-helices. From an average C-value in the abovementioned formula we then define classes in which the measured NOEs are subdivided according to their intensity. In this tutorial we have used classes defined as (which is part of the input of Regine2Xplor):
Note that these class-definitions are quite arbitrary, but they work well and are used by many people. Some people add another 10 to 20% to these bounds as a conservative approach (reasoning that distances that are set too loose are better than distances that are set too tight). Others even set all measured NOEs to 6 Angstrom, regardless of the NOE intensity. Also note that we use no lower bounds (0 Angstrom)! The minimal distance of 1.8 Angstrom between two protons is satisfied in the structure calculation using a van der Waals term (see also questions below). A note on pseudo-atom corrections Previously people often used so-called CENTER-averaging for distance restraints, which means that for protons within a group (e.g. methylene, aromatics ring protons, valine/leucine methyl groups) for which no stereo-specific assignments are available the distance is taken to the geometric center of this group of protons. This method introduces a large uncertainty on the proton-proton distances and so-called pseudo-atom corrections have to be added to the upper-bound of the restraint. A physically more realistic treatment (although still not perfect) of the NOEs is done with the so-called SUM-averaging method. This method does not use the geometric center of the protons within a group but sums the contributions of all inter-proton distances between the groups defined in the NOE-distance restraint. The main advantage of this method is that NO pseudo-atom corrections have to be made on the distance restraints (they may even be tightened, see reference below). In this tutorial we will only use the SUM averaging method. For an elaborate discussion of the different methods of treatment of NOEs see: Fletcher et al. J. Biomol. NMR., 8, 292-310 (1996). Q: Can you think of physically more correct treatments of NOEs? Q: What would happen in case two methyl groups are in direct contact and the NOE is treated as the SUM-average with distance bounds ranging from 1.8-2.8 Angstrom? Is the widely used lower bound of 1.8 Angstrom a realistic value? 4. Structure calculations based on manually assigned input data using CNSThe structure calculation and validation is basically done in four steps: 1) Generating a protein topology file based on the amino-acid sequence 2) Generating an extended protein starting structure 3) Simulated annealing starting from an extended structure 4) Checking of the calculated structures which will be explained in detail below. The protocol described below will be the basic protocol tested by all groups of students. As an exercise during the first part of the course look at the scripts that are created at each of the four steps and try to understand what is the general idea behind it (write the procedure down schematically). This will be of use to you when in specific cases you may have to deviate from the standard protocols and add procedures yourself.
1) Generating a protein topology file based on the amino-acid sequence In order to calculate molecular structures CNS first needs a molecular topology file (.mtf for molecular topology file or .psf for protein structure file: .psf is still a remainder from XPLOR) that contains all the information on bonds, angles, charges etcetera of the molecule under investigation. To generate the molecular topology file for hp62 we only need the amino-acid sequence which is written in the lac.seq file (located in your working directory).
In your web-browser go to the CNS-starting page (located under bookmarks:BNRA course) and go to
Click the "edit"-button of generate_seq.inp Enter "lac.seq" at the protein sequence file entry Enter "lac.psf" at the output file entry Set all disulphide entries to "false" Set the hydrogen flag on (we are doing NMR...) Save the updated file as generate_seq.inp in your working directory
In the winterm run cns < generate_seq.inp > generate_seq.out this creates the lac.psf file and redirects the CNS log to generate_seq.out. Check the log for errors to see if anything goes wrong.
2) Generating an extended protein starting structure From the molecular topology file we can now generate a starting structure in an extended conformation. This structure is a well energy-minimized extended structure. The energy minimization is done in order to have a good definition of bond-lengths and bond-angles which are kept constant during Torsion Angle Dynamics (TAD). For Cartesian dynamics this is less critical since bond-lengths and angles can vary during this type of structure calculation. Go to
Edit generate_extended.inp Enter "lac.psf" at the structure file entry Enter "lac_extended.pdb" at the output coordinates entry
In the winterm run cns < generate_extended.inp > generate_extended.out Take a quick look at the lac_extended.pdb rasmol lac_extended.pdb
3) Simulated annealing starting from an extended structure We are ready to run the structure calculation now using a simulated annealing protocol. ( The name of this run is run_std_hbonds; see also the scheme below for the other runs) The standard protocol which everyone will run is now described Go to
Edit anneal.inp: Enter "lac.psf" at the structure file entry Enter "lac_extended.pdb" at the input coordinates entry Select Cartesian at the type of molecular dynamics for hot phase entry Select Cartesian at the type of molecular dynamics for cool phase entry Enter 20 at the number of trial or accepted structures entry Use 2000K, 10000 md-steps, NOE-scale=50, md-timestep=0.005 and temperature-step=25 for high temperature annealing and the first slow-cooling stage, use VDW scale factor=4 for the first cooling stage Use 1000K, 3000 md-steps, NOE-scale=50, md-timestep=0.005 and temperature-step=25 for the second slow-cooling stage Use initial VDW scale factor=1, final VDW scale factor=4 for the second cooling stage Use NOE-scale=50, 200 minimizaton steps and 10 minimization cycles for the minimization stage Enter ./nmr_data_manual/unambig.tbl for restraint set 1 file Enter ./nmr_data_manual/ambig.tbl for restraint set 2 file Enter ./nmr_data_manual/hbonds_csi.tbl for hydrogen-bond distance restraints file Empty all the 3J-bond coupling data, 1J-bond coupling data, chemical shift data and other restraint data file entries (you only need to empty the file entry fields!) You don't need to enter the base name for input coordinate files Enter lac_anneal as the base name for output coordinate files Save the updated file as anneal.inp
In the winterm run cns < anneal.inp > anneal.out & This will take a little while.... Take a look at the lac_anneal_*.pdb
4) Structure analysis and quality assessment The calculated structures now have to be checked whether they correspond to the experimental input data and whether they have good local structure (e.g. bond-lengths, bond-angles, peptide-bond planarity) and stereochemistry (e.g. Ramachandran plots for proteins, sidechain-conformations etc.) To check the violations of experimental input data and the quality of local structure we can use the CNS accept.inp script: Go to
Edit accept.inp IMPORTANT: Make sure the accept script input is consistent with what you have used in the simulated annealing. Use lac_extended.pdb as the reference structure For the atom selection used for the calculation of the average structure use: (name n or name ca or name c) and resid 5:45 Calculate the average structure only for accepted structures
Apart from making the accept.inp input consistent with the anneal.inp input, leave everything unchanged (of course you can play around with the acceptance criteria as you like) Use as base name for input coordinate file(s): lac_anneal Use as base name for output coordinate file(s): lac_accept Save the updated file as accept.inp
In the winterm run cns < accept.inp > accept.out & The program has written the accepted structures, the energy-minimized average of the accepted structures (averaged on residue 5-45, backbone heavy atoms only) and an RMSD table (*_rms_*.dat file). In order to view all accepted structures at once we will join them into 1 file (-o for output filename): joinpdb -o accept_all.pdb lac_accept_[1-8].pdb (This will join lac_accept_1.pdb to lac_accept_8.pdb into accept_all.pdb) To avoid lots of error messages in PROCHECK we also need to rename some atoms in the pdb file using the nawk script patchXPLORpdb: patchXPLORpdb accept_all.pdb > acceptp_all.pdbNote that since PROCHECK_COMP reads in individual structures, you should run patchXPLORpdb on every accepted structure individually to avoid error messages in procheck. We will use MOLMOL and PROCHECK_COMP for visualization and further analysis of the generated structures. Inputs are the final ensembles of structures. To run PROCHECK_COMP create a file, e.g. lac.lis containing the names of the structures you wish to analyse. e.g.: lac_accept_1.pdb lac_accept_2.pdb lac_accept_3.pdb ...To run PROCHECK_COMP then just type in your winterm:
procheck_comp lac.lisPROCHECK_COMP generates PostScript files that can best be viewed using xpsview on SGI (type xpsview <prefix>*.ps) or Ghostview on Linux (type gs <prefix>*.ps) can be printed for your presentation. Note Procheck_comp can also be used to compare the quality of structures obtained following various protocols. Below is a list of properties of your structures that should always be presented, and that you should incorporate in your presentation at the end of the course. These include:
Q: What comments could you make on these last criteria? Other parameters included in the structural statistics are mere variations on these themes. In fact, there are a few main qualities to which every structure analysis can be reduced. Q: Which are these qualities? Now that we have checked the ensembles for violations, we can consider the properties of their average structures. We have used the accept.inp script of CNS to calculate the energy-minimized average structure. Inspect it, see if you understand how it works and calculate a non-energy minimized average structure for comparison with the energy minimized structure. Q: How would you decide which part of the structure should be used for the calculation of the average? Q: Why do we first have to superimpose the region used for averaging of the ensemble? Q: Determining an average structure can be less trivial than it seems. Under which circumstances would you expect the largest problems? How could you deal with these? Note: The generation and meaning of an average structure is an important problem. Consider for example what would happen in case of conformational averaging (say a loop exists in two conformations); how good would a structure obtained by a simple positional averaging procedure be? Methods exists to deal with such problems. Check for example http://www.nmr.chem.uu.nl/~abonvin/ensemble_poster.html Often an average structure is deposited into the Protein Data Bank (http://www.rcsb.org/pdb) as a representative structure of the ensemble of NMR structures. A better and more physical choice would be to deposit the structure the closest to the average structure as representation of the ensemble of NMR structures. The accept.inp script of CNS also determines the rmsd per residue. We can inspect these by looking at the *_rms_*.dat file. Visualize this by loading it into Xmgr. [choose `Read block data' from the menu; select the .dat file; set X from column 1 and Y from column 2; press Accept; press the autoscale button just below the Draw button in the main window. Other options may be selected from the menu to change tick labels, add titles etc.] In addition, we can check the `stereochemical quality' of the non-energy minimized average using PROCHECK. Q: How does it compare to that of the individual structures of the ensemble? and how to the energy minimized average? As you can see the average structure is energy-minimized to `repair' any structural anomalies. Take a look at the energy minimization part of the CNS script and identify all the terms that are used in the forcefield. In case the stereochemical quality of the energy-minimized structure is not satisfactory, one can increase the number of minimization steps to try to improve the quality of the minimized average structure.
Now you have been introduced to structure calculation and validation you are required to run at least three different protocols in addition to the standard reference protocol described above in the intructions. We will use these runs for evaluation at the end of the course. The scheme is:
Try to set-up the calculations in series overnight on one machine, so next day you are ready to analyse all of the runs. Another interesting exercise (which only takes very little time) is to calculate the structures using the default values in the anneal.inp script. You can do this for both Cartesian and Torsion angle dynamics. Note: Since several groups will be running the same protocols, to avoid obtaining exactly the same results, change the random seed number in anneal.inp. Analyse and compare the resulting structures. Consider the following questions: Q: How does Torsion Angle Dynamics compare to Cartesian dynamics in terms of cpu-time (Hint: look at the last lines of the CNS output files) and numbers of accepted structures? Q: Does the use of chemical shift data (translated to hydrogen bond- and/or dihedral angle restraints) improve the structures in terms of quality? And in terms of efficiency (=number of accepted structures)? Q: What happens when using the conservative approaches using only distance-restraints of 4.5 or 5.5 Angstrom? What do you think of this method? Q: How do the structures calculated using the default input parameters for the anneal.inp script compare to those used in the standard run? Elaborate on this. What is your conclusion? 5. Structure calculation using ARIA scripts and manually assigned NOE data
The main advantage of ARIA is its capabilities of doing automatic NOE assignments in raw NMR data. However, the program also allows the use of manually assigned and calibrated NOE-data or combinations of manual and automatic assignments. We will start of here with the use of manually assigned NOE data only as input for the ARIA scripts and do some variations on the input data (such as in the previous sections) The basic run (run1) that all students will do is described in this section. First open the ARIA page in your bookmarks: Choose: ARIA.tbl format (already calibrated data) Read the instructions carefully and enter: Do NOT modify the path to the current ARIA program directory! Path of the new project run number (1) unambiguous datalist (unambig.tbl) ambiguous datalist (ambig.tbl) Residual dipolar couplings 1 (lac_rdc.tbl) sequence file (lac.seq) chemical shift index file (lac_csi.out) CSI dihedral restraints errors: 30 Make sure that all fields that are not used are blank!! Now save the file as new.html Run aria in the winterm: aria This will create a directory tree necessary for the calculations under ./run1/
Before we continue we need to provide ARIA with a file for stereo-specific assignments. In principle ARIA can do these assignments using a swapping routine (swaps atoms names) in the course of the calculation. However, when the stereo-specific assignments have been done using experimental methods this is of course the way to go. In the case of hp62 we have stereo-specific assignments for all leucine and valine prochiral methyl groups. To use them do the following: cp ./nmr_data_manual/stereoassign.cns ./run1/data/sequence/ The next thing we have to do is to edit the ./run1/run.cns file to define which protocol we will use. To do so go back to the main ARIA page and enter the location of the run.cns file. First we have to enter/check some file and directory names: Project name Project directory Run directory Template pdb file Protein structure file Sequence file Now only change the fields listed below: Check the run_number in the readout directory Set the use of CSI derived hbonds restraints to TRUE Set the type of non-bonded parameters to PARALLHDG Do you want to include dihedral angle energy terms: false Use for the queing command: csh Set the number of jobs to 1 Use for the cns executable the absolute path of cns (type: "which cns" in the winterm to see this).
For the SA protocol use: Type of molecular dynamics: Cartesian Initial temperature for TAD: 2000 Initial temperature for TADCartesian dynamics: 2000 Final temperature after first cooling step: 1000 Final temperature after second cooling step: 50 Cartesian time step: 0.005 Factor for timestep and number of steps in TAD: 9 Initial number of MD steps: 10000 Number of MD steps for refinement: 2000 Number of MD steps during first cooling stage: 10000 Number of MD steps during second cooling stage: 3000 First iteration for cartesian refinement (after TAD): 0 Use the water refinement for the last iteration and select 10 structures Save the updated run.cns file (overwrite)
In the winterm go to ./run1/ and run aria and redirect the output to a log file aria1.1 > &aria.log& Your tasks: Similar to what you did for the CNS structure calculations you are required to run, next to the standard protocol described above (run1), at least one other structure calculation protocol using ARIA scripts and manually assigned data. The scheme is:
5. Structure calculation using ARIA and automatic assignment of NOE dataThe final structure calculation protocol of this course will use the automated assignment possibilities in ARIA. The basic run (run1) that all students will do is described in this section. On the ARIA starting page select:
Regine format (uncalibrated data) Edit the directory and file names: Current ARIA program directory: Path of the new project: Run number: For the spectra use the following input: SPECTRUM 1: Name: 15N Peaks file: lac_15N-NOE-HSQC.pks Shifts file: lac.shifts PPMD for Heteronucleus 1: 0.5 PPMD for Proton 1: 0.05 PPMD for Heteronucleus 2: (leave empty) PPMD for Proton 2: 0.05 SPECTRUM 2: Name: 13C Peaks file: lac_13C-NOE-HSQC.pks Shifts file: lac.shifts PPMD for Heteronucleus 1: 0.5 PPMD for Proton 1: 0.05 PPMD for Heteronucleus 2: (leave empty) PPMD for Proton 2: 0.05 SPECTRUM 3: Name: 15N15N Peaks file: lac_15N-HMQC-NOE-HSQC.pks Shifts file: lac.shifts PPMD for Heteronucleus 1: 0.5 PPMD for Proton 1: 0.05 PPMD for Heteronucleus 2: 0.5 PPMD for Proton 2: 0.05 SPECTRUM 4: Name: 2D Peaks file: lac_2DNOE_H2O.pks Shifts file: lac.shifts PPMD for Heteronucleus 1: (leave empty) PPMD for Proton 1: 0.05 PPMD for Heteronucleus 2: (leave empty) PPMD for Proton 2: 0.05 Sequence file: lac.seq Residual dipolar couplings 1 (lac_rdc.tbl) Chemical shift index file: lac_csi.out CSI dihedral restraints errors: 30 Make sure that all fields that are not used are blank!! Save the file as new.html run aria in the winterm: aria As in the previous exercises this will create a directory tree necessary for the calculations under ./run1/ Copy the stereo-specific assignments for all leucine and valine prochiral methyl groups into the run directory. cp ~/nmr_data_manual/stereoassign.cns ./run1/data/sequence/ The next thing we have to do is to edit the ./run1/run.cns file to define which protocol we will use. To do so go back to the ARIA-starting page and do the editing. Again, enter and check file and directory names. Check that the CNS to IUPAC conversion is set to true Now only change the fields listed below: Set the use of hydrogen bond restraints to false set the use of dihedral restraints to false Set the use of CSI derived hbond restraints to true Set the use of CSI derived dihedral restraints to true set the type of non-bonded parameters to PROLSQ Do you want to include dihedral angle energy terms: true For the spectra parameters: set qrelax to false for all spectra For these calculations we will not use the parallel jobs option use for the queing command: csh use for the cns executable the absolute path of cns (type: which cns in the winterm to see this). use 1 job at a time For the SA protocol use: type of molecular dynamics: Cartesian initial temperature for TAD: 2000 initial temperature for TADCartesian dynamics: 2000 final temperature after first cooling step: 1000 finale temperature after second cooling step: 50 drop in temperature (K) per cycle of dynamics: 50 Cartesian time step: 0.003 factor for timestep and number of steps in TAD: 9 initial number of MD steps: 10000 number of MD steps for refinement: 2000 number of MD steps during first cooling stage: 5000 number of MD steps during second cooling stage: 2000 first iteration for cartesian refinement (after TAD): 0 Use the water refinement for the last iteration and select 10 structures Save the updated run.cns file (overwrite) In the winterm go to ./run1/ and run aria aria1.1 > &aria.log & Since we are running aria in a single processor, this will take quite some time (about 2 to 3 days). Your tasks: Next to the standard protocol described above you are required to run one additional protocol. The second run will however be performed in Nijmegen. The scheme is:
A number of scripts to compare the manually assigned peaks with the automatic assignment results from ARIA are provided in the ./nmr_data_aria/compare directory. To run the comparison, copy the entire compare directory in the same directory where the run (e.g. run1) directories are present. Edit the run_comparison.csh script and change the run number (define by the variable ir), e.g.: # define aria run number set ir=3 and then run the comparison by typing in your winterm: ./run_comparison.csh This will create a run directory containing results for each spectrum and ARIA iteration. Check the .out files and compare the number of assignments as a function of the iteration number. How many identical and different assignments? How do the various protocols compare? Gather this information from the other teams as well to obtain an overview of the various protocols and decide which setup was the best one in terms of correct/incorrect assignments. Also compare the ARIA generated lac structures with the ones obtained from manual assignments: how close are they? What are the differences (e.g. rmsd, secondary structure, stereochemical quality). Use for the comparison the structures obtained with the standard reference run in CNS. Note: You can use molmol to calculate rmsd between two structures. Another possibility is to use the program profit. The following example calculates various rmsd values between a reference structure (lac_std.pdb) and another structure (lac_aria.pdb) (the comments after ! are explanations and should not be used as commands within profit):
profit
reference lac_std.pdb ! reads in the reference structure
zone 5-45 ! defines zone in residue numbers for the fitting
atoms CA,C,N ! defines the atoms used for fitting
mobile lac_aria.pdb ! reads in the structure to be fitted
fit ! performs the fitting. You can write out the fitted
! structure if you wish with the command write "filename"
rzone 1-62 ! calculates rmsd for the entire backbone
! To define a new zone (or rzone) first clear the old
! definition with zone (or rzone) clear
ratoms ^H* ! calculates rmsd for all heavy atoms (H not included)
quit
|
In this part of the course you will be made more familiar with some widely used tools to check the quality of protein structures. Here we will distinguish between the quality in terms of how well the structures represent (fit) the experimental data, and the quality in terms of local and overall geometry.
procheck accept_1.pdb 2
Explanation: Procheck compares accept_1.pdb to the averages for X-ray structures of 2A resolution
You will get a number of files summarizing the results. Inspect the postscript files and go through the .out and .sum file. Try to understand the output.
Now do an analysis of the complete ensemble using PROCHECK_NMR and look at the differences in PROCHECK and PROCHECK_NMR outputs.
Important note: often people use only the so-called well-defined regions of their structures in the PROCHECK_NMR analysis. This improves the apparent quality of the structures, especially for the Ramachandran plot. But that is only estetics and you must be careful on how you present such analyses. It should not be used for hiding possible problems in less well-defined regions.
Now we will do some more critical checks using WHAT IF.
whatif
Explanation: Start whatif
dolog
Explanation: This writes the screen-output to a log file. Enter a filename and leave the request for text blank
SETWIF 593 100000
Explanation: This command makes the output of whatif very long, so you will see all errors
%fulchk y accept_1.pdb accept_1.pdb
Explanation: This performes all the checks on "accept_1.pdb"
n
Explanation: Don't exit yet
nolog
Explanation: Close the log file
exit
Explanation: Now exit WHAT IF
You can now read the report in the log file. Go through it and try to understand the checks that have been performed. Pay special attention to:
Exercise:
Now, just to make you feel better about your own structure and to illustrate the use of validation, have a look at structures 1i1s and 1ka3 (take any member of the ensembles in /home1/coursead/1i1s/ and /home1/coursead/1ka3/).
Now that you have become familiar with the checking software we want to get some statistics on the structure ensembles you have calculated in the first part of the course. To keep you from having to run all checks on individual structures I have written a script that does it for you and provides the statistics in a summary file. The script only runs on machines obiof1, obiof10 and obiof11 (sorry, don't blame me). So we have to divide the jobs a little (ask me!).
Do the following:
Move the different ensembles you calculated into seperate directories. Note that structures should not be joined into 1 file but kept seperately. All structures should be numbered sequentially as in xxx_1.pdb xxx_2.pdb etc.:
mkdir lac_test
mv accept*pdb lac_test
cd lac_test
Create a ZONES (=filename) file in this directory.
The ZONES file should contain the regions you want to use for superposition of the structures in the ensemble. It should look something like:
Now run the script (login to one of the abovementioned computers):
~/NMR_VALIBASE/nmr_valibase.py -nmr_check filebase no_residues
Explanation: filebase = base name of the files e.g. "accept_" or whatever name you have.
no_residues = number of residues in the sequence (=62 for lac headpiece)
The script should be run on each ensemble you calculated. The statistics you will get for each ensemble can now be used for comparison of the different calculation protocols. Your final task is now to do this comparison and to prepare figures for the presentation. Try to illustrate the results using both the PROCHECK and WHAT IF results. For example, it will be very illustrative for the understanding of the "abstract" Z-scores for the Ramachandran and Chi1-Chi2 rotamer normality if you compare them to the plots generated by PROCHECK. Try to find the correlation between the Z-score and the percentages in the most favoured regions of the Ramachandran plot. Do something similar for the Chi1-Chi2 distributions (use the plots). If you do not have enough data to get a correlation, then use the results from the other groups.
Good luck!!
Every group has to present some of their results that were obtained during the course. The time for one presentation is 10-15 minutes, followed by questions and discussion (~5 minutes). Since their is only limited time for the presentation try to present your results in a compact manner. Discuss the essential differences (i.e. acceptance scores, quality parameters, noteable differences in global fold etc) in the results of the structure calculations using the various protocols and various input datasets.