Advanced Course

"From Data to Structures"

June 10-21 2002, Utrecht Nijmegen

Instructions for the practical part:

A.Bonvin and C.Spronk, version June 15, 2000
P.Fuchs and B. van Buuren, update June, 2002
C.Spronk, Added structure validation section, June 2002


1.1 Introduction

This practical is meant to familiarize you with structure calculation and analysis methods in use for the determination of NMR structures. A large number of programs and protocols are being used for this purpose. In this practical we will however concentrate on the use of one particular program, CNS (Crystallographic and NMR System) from Axel Brünger's laboratory at Yale university ( CNS is the follow-up of the widely used program X-PLOR (see www address above for more information) also developed at Yale. If using CNS, please cite the following reference in any publication:

"Crystallography and NMR system (CNS): a new software system for macromolecular structure determination". Brunger A.T., Adams P.D., Clore G.M., Delano W.L., Gros P., Grosse-Kunstleve R.W., Jiang J.-S., Kuszewski J., Nilges M., Pannu N.S., Read R.J., Rice L.M., Simonson T. and Warren G.L. Acta Cryst. D54, 905-921 (1998).

The interface of CNS is your web-browser and you will find that editing of parameters for standard structure calculation protocols can be easily performed using this interface. A drawback of this method is of course that the standard protocols may not work sufficiently well for specific structure calculation problems and one may have to edit the protocols at the level of the scripts. However, for most purposes the methods used in this course are fine and form a good basis for understanding structure calculations.

A very promising recent development in structure calculation is the program ARIA. ARIA ( can be used in combination with CNS for automated NOE assignment and structure calculation. In principle this method is faster, with less manual intervention and allows for higher numbers of possible assignments for NOEs. For these reasons ARIA is, according to us, the way to do structure calculations in the future.

This brings us to the main objectives of this practical. These will be to compute the structure of the 62-residue lac repressor headpiece protein (HP62, hp62 or lac) using various structure calculation protocols in CNS with manually assigned NOEs and chemical shifts as input data. Apart from varying the structure calculation protocols, which is only a minor part of this course, we will also look at the effect of different input data (i.e. including/excluding chemical shift data, using wrong assignments, wrong calibration of NOE-intensities etc) on the structure calculation. After some training in the use of CNS and structure validation/visualization we will run ARIA to do various automatic NOE assignment and structure calculation protocols.

At the end of the course we will compare the performance and efficiency of the various structure calculation protocols. Further, we will analyse and compare the quality of structures generated with the various protocols, using both manually and automatically assigned NOEs.

Try to answer the questions you encounter throughout the instructions (they are indicated by Q:).

2. Practical Aspects

You will have to form teams consisting of two persons each. Each team will be given an account (course1, course2,...) and password. You will find in those accounts the NMR data that you will need for the structure calculations (in directory nmr_data_manual).

Programs we will use during the course are:

Structure calculations (& analysis):

  • CNS
  • ARIA

Various protocols for structure calculations can be used within CNS:

  • distance geometry combined with simulated annealing
  • simulated annealing in Cartesian space starting from an extended structure
  • simulated annealing in torsion angle space. (This protocol is in principle similar to the one in Cartesian space, but instead of using all degrees of freedom (including bond lengths and angles) only dihedral angles are considered. This considerably decreases the total number of degrees of freedom and can result in a more efficient search of conformational space. In addition the ratio of observations/parameters is increased.)
Here we will only test the latter two options using different input data-sets.

Structure analysis and quality:







Note that it is important to take a look at the online manuals of PROCHECK and MOLMOL. Further, people who are not familiar with MOLMOL are strongly recommended to do the online tutorial of the program.

3. NOE restraints

Data have been collected on the HP62-DNA complex, which consists of two HP62 monomers and a symmetrical DNA sequence. Here we will focus on the structure determination of one HP62 monomer only, using data derived from the complex. The amino-acid sequence is:


Several experiments have been performed on the HP62-DNA complex: we will use NOE data from 4 different spectra obtained at 750 MHz with a mixing time of 100 ms:

  • 2D NOESY

In addition to NOE data we will be using carbon and proton chemical shifts to identify secondary structure elements in HP62 and convert the chemical shift information to dihedral angle restraints and to distance restraints for hydrogen-bonding patterns in alpha-helices.

In case of the distance restraints we have created two types of input data for the structure determination:

  1. The manual input restraints were created from NOE-data assigned and integrated in REGINE4. These data were converted to XPLOR-type restraints using the Regine2Xplor program and are ready for use in CNS.

    These datafiles are:

    • unambig.tbl for unambiguously assigned restraints

    • ambig.tbl for ambiguously assigned restraints

  2. A second set of input data are the raw- (integrated) peakfiles and a list of the chemical shifts of the different resonances in HP62. This combination will be used by ARIA to assign the NOEs in the peakfiles automatically in an iterative manner.

Note about calibration of NOE intensities

When using ARIA the NOEs are calibrated automatically using statistical methods. However NOE intensities can be calibrated manually using known proton-proton distances in proteins or nucleic acids and the approximate relationship:

    NOE-intensity = C / r6.

Where r is the inter-proton distance and C is constant that can be used for calibration. For this one can use distances that are fixed by the covalent structure of the molecule under investigation (e.g. pairs of methylene protons or aromatic ring protons). However, these fixed distances always suffer a lot from spin-diffusion and are very inaccurate. (Note that manual calibration is always very qualitative). It is better to use known distances in secondary structure elements that you have identified in your molecule under investigation. For secondary structure elements in proteins one can use table 7.1, page 127 of Wüthrichs NMR-bible (Wüthrich, K. NMR of proteins and nucleic acids. Wiley-Interscience, New York (1986)).

Here we have done the manual calibration for you based on sequential HN-HN (~2.8 Angstrom, Ha-HN (~3.5 Angstrom) and Ha-HN(i,i+3) (~3.4 Angstrom) NOEs in a-helices. From an average C-value in the abovementioned formula we then define classes in which the measured NOEs are subdivided according to their intensity.

In this tutorial we have used classes defined as (which is part of the input of Regine2Xplor):

  • 0-2.8 Angstrom
  • 0-3.5 Angstrom
  • 0-4.5 Angstrom
  • 0-5.5 Angstrom

Note that these class-definitions are quite arbitrary, but they work well and are used by many people. Some people add another 10 to 20% to these bounds as a conservative approach (reasoning that distances that are set too loose are better than distances that are set too tight). Others even set all measured NOEs to 6 Angstrom, regardless of the NOE intensity.

Also note that we use no lower bounds (0 Angstrom)! The minimal distance of 1.8 Angstrom between two protons is satisfied in the structure calculation using a van der Waals term (see also questions below).

A note on pseudo-atom corrections

Previously people often used so-called CENTER-averaging for distance restraints, which means that for protons within a group (e.g. methylene, aromatics ring protons, valine/leucine methyl groups) for which no stereo-specific assignments are available the distance is taken to the geometric center of this group of protons. This method introduces a large uncertainty on the proton-proton distances and so-called pseudo-atom corrections have to be added to the upper-bound of the restraint.

A physically more realistic treatment (although still not perfect) of the NOEs is done with the so-called SUM-averaging method. This method does not use the geometric center of the protons within a group but sums the contributions of all inter-proton distances between the groups defined in the NOE-distance restraint. The main advantage of this method is that NO pseudo-atom corrections have to be made on the distance restraints (they may even be tightened, see reference below). In this tutorial we will only use the SUM averaging method. For an elaborate discussion of the different methods of treatment of NOEs see:

Fletcher et al. J. Biomol. NMR., 8, 292-310 (1996).

Q: Can you think of physically more correct treatments of NOEs?

Q: What would happen in case two methyl groups are in direct contact and the NOE is treated as the SUM-average with distance bounds ranging from 1.8-2.8 Angstrom? Is the widely used lower bound of 1.8 Angstrom a realistic value?

4. Structure calculations based on manually assigned input data using CNS

The structure calculation and validation is basically done in four steps:

    1) Generating a protein topology file based on the amino-acid sequence

    2) Generating an extended protein starting structure

    3) Simulated annealing starting from an extended structure

    4) Checking of the calculated structures

which will be explained in detail below. The protocol described below will be the basic protocol tested by all groups of students. As an exercise during the first part of the course look at the scripts that are created at each of the four steps and try to understand what is the general idea behind it (write the procedure down schematically). This will be of use to you when in specific cases you may have to deviate from the standard protocols and add procedures yourself.


1) Generating a protein topology file based on the amino-acid sequence

In order to calculate molecular structures CNS first needs a molecular topology file (.mtf for molecular topology file or .psf for protein structure file: .psf is still a remainder from XPLOR) that contains all the information on bonds, angles, charges etcetera of the molecule under investigation. To generate the molecular topology file for hp62 we only need the amino-acid sequence which is written in the lac.seq file (located in your working directory).


In your web-browser go to the CNS-starting page (located under bookmarks:BNRA course) and go to

  • input files
    • general

Click the "edit"-button of generate_seq.inp

Enter "lac.seq" at the protein sequence file entry

Enter "lac.psf" at the output file entry

Set all disulphide entries to "false"

Set the hydrogen flag on (we are doing NMR...)

Save the updated file as generate_seq.inp in your working directory


In the winterm run

cns < generate_seq.inp > generate_seq.out

this creates the lac.psf file and redirects the CNS log to generate_seq.out. Check the log for errors to see if anything goes wrong.


2) Generating an extended protein starting structure

From the molecular topology file we can now generate a starting structure in an extended conformation. This structure is a well energy-minimized extended structure. The energy minimization is done in order to have a good definition of bond-lengths and bond-angles which are kept constant during Torsion Angle Dynamics (TAD). For Cartesian dynamics this is less critical since bond-lengths and angles can vary during this type of structure calculation.

Go to

  • input files
    • NMR

Edit generate_extended.inp

Enter "lac.psf" at the structure file entry

Enter "lac_extended.pdb" at the output coordinates entry


In the winterm run

cns < generate_extended.inp > generate_extended.out

Take a quick look at the lac_extended.pdb

rasmol lac_extended.pdb


3) Simulated annealing starting from an extended structure

We are ready to run the structure calculation now using a simulated annealing protocol.

( The name of this run is run_std_hbonds; see also the scheme below for the other runs)

The standard protocol which everyone will run is now described

Go to

  • input files
    • NMR

Edit anneal.inp:

Enter "lac.psf" at the structure file entry

Enter "lac_extended.pdb" at the input coordinates entry

Select Cartesian at the type of molecular dynamics for hot phase entry

Select Cartesian at the type of molecular dynamics for cool phase entry

Enter 20 at the number of trial or accepted structures entry

Use 2000K, 10000 md-steps, NOE-scale=50, md-timestep=0.005 and temperature-step=25 for high temperature annealing and the first slow-cooling stage, use VDW scale factor=4 for the first cooling stage

Use 1000K, 3000 md-steps, NOE-scale=50, md-timestep=0.005 and temperature-step=25 for the second slow-cooling stage

Use initial VDW scale factor=1, final VDW scale factor=4 for the second cooling stage

Use NOE-scale=50, 200 minimizaton steps and 10 minimization cycles for the minimization stage

Enter ./nmr_data_manual/unambig.tbl for restraint set 1 file

Enter ./nmr_data_manual/ambig.tbl for restraint set 2 file

Enter ./nmr_data_manual/hbonds_csi.tbl for hydrogen-bond distance restraints file

Empty all the 3J-bond coupling data, 1J-bond coupling data, chemical shift data and other restraint data file entries (you only need to empty the file entry fields!)

You don't need to enter the base name for input coordinate files

Enter lac_anneal as the base name for output coordinate files

Save the updated file as anneal.inp


In the winterm run

cns < anneal.inp > anneal.out &

This will take a little while....

Take a look at the lac_anneal_*.pdb


4) Structure analysis and quality assessment

The calculated structures now have to be checked whether they correspond to the experimental input data and whether they have good local structure (e.g. bond-lengths, bond-angles, peptide-bond planarity) and stereochemistry (e.g. Ramachandran plots for proteins, sidechain-conformations etc.)

To check the violations of experimental input data and the quality of local structure we can use the CNS accept.inp script:

Go to

  • input files
    • NMR
    • Edit accept.inp

      IMPORTANT: Make sure the accept script input is consistent with what you have used in the simulated annealing.

      Use lac_extended.pdb as the reference structure

      For the atom selection used for the calculation of the average structure use:

      (name n or name ca or name c) and resid 5:45

      Calculate the average structure only for accepted structures

      Apart from making the accept.inp input consistent with the anneal.inp input, leave everything unchanged (of course you can play around with the acceptance criteria as you like)

      Use as base name for input coordinate file(s): lac_anneal

      Use as base name for output coordinate file(s): lac_accept

      Save the updated file as accept.inp


      In the winterm run

      cns < accept.inp > accept.out &

      The program has written the accepted structures, the energy-minimized average of the accepted structures (averaged on residue 5-45, backbone heavy atoms only) and an RMSD table (*_rms_*.dat file).

      In order to view all accepted structures at once we will join them into 1 file (-o for output filename):

      joinpdb -o accept_all.pdb lac_accept_[1-8].pdb 

      (This will join lac_accept_1.pdb to lac_accept_8.pdb into accept_all.pdb)

      To avoid lots of error messages in PROCHECK we also need to rename some atoms in the pdb file using the nawk script patchXPLORpdb:

      patchXPLORpdb accept_all.pdb > acceptp_all.pdb

      Note that since PROCHECK_COMP reads in individual structures, you should run patchXPLORpdb on every accepted structure individually to avoid error messages in procheck.

      We will use MOLMOL and PROCHECK_COMP for visualization and further analysis of the generated structures. Inputs are the final ensembles of structures.

      To run PROCHECK_COMP create a file, e.g. lac.lis containing the names of the structures you wish to analyse. e.g.:

      To run PROCHECK_COMP then just type in your winterm:

      procheck_comp lac.lis

      PROCHECK_COMP generates PostScript files that can best be viewed using xpsview on SGI (type xpsview <prefix>*.ps) or Ghostview on Linux (type gs <prefix>*.ps) can be printed for your presentation.

      Note Procheck_comp can also be used to compare the quality of structures obtained following various protocols.

      Below is a list of properties of your structures that should always be presented, and that you should incorporate in your presentation at the end of the course. These include:

      1. Restraint violations. Generally people choose a certain threshold (typically somewhere between 0.1 and 0.5 A for distance restraints, and 1 to 5 degrees for dihedral angle restraints) above which the violations are scored.

          Note: Restraint violations can be found in the output generated by accept.inp. A number of awk scripts are available in the ~/scripts directory to extract restraint violation from the accept.out file. To perform the analysis copy the scripts from the ~/scripts directory into your working directory and execute them by typing:
          ./ana_noe_viol.csh accept.out >ana_noe_viol.out
      2. Q: What are the default threshholds for NOE and dihedral restraints in the CNS accept.inp script you have used?

        Q: Why do people use this threshold, why not just count all violations?

      3. CNS energies (subdivided into the empirical bond, angle, improper, VdW energies and those from the experimental force field). Often also root-mean-square deviations from ideal geometry (for bonds, angles and impropers).

        Q: What comments could you make on these last criteria?

      4. R.m.s. deviations from experimental restraints

      5. Cartesian positional root-mean-square differences. Often, values are calculated separately for backbone atoms and all heavy atoms. It can also be informative to calculate local rmsd's to illustrate local/mutual flexibilities or ill-definedness.

        Q: How could you illustrate mutual flexibility, like around a hinge region, in terms of rmsd's? How could one best visualize this?

      6. Other parameters included in the structural statistics are mere variations on these themes. In fact, there are a few main qualities to which every structure analysis can be reduced.

        Q: Which are these qualities?

        Now that we have checked the ensembles for violations, we can consider the properties of their average structures. We have used the accept.inp script of CNS to calculate the energy-minimized average structure. Inspect it, see if you understand how it works and calculate a non-energy minimized average structure for comparison with the energy minimized structure.

        Q: How would you decide which part of the structure should be used for the calculation of the average?

        Q: Why do we first have to superimpose the region used for averaging of the ensemble?

        Q: Determining an average structure can be less trivial than it seems. Under which circumstances would you expect the largest problems? How could you deal with these?

        Note: The generation and meaning of an average structure is an important problem. Consider for example what would happen in case of conformational averaging (say a loop exists in two conformations); how good would a structure obtained by a simple positional averaging procedure be? Methods exists to deal with such problems. Check for example Often an average structure is deposited into the Protein Data Bank ( as a representative structure of the ensemble of NMR structures. A better and more physical choice would be to deposit the structure the closest to the average structure as representation of the ensemble of NMR structures.

        The accept.inp script of CNS also determines the rmsd per residue. We can inspect these by looking at the *_rms_*.dat file. Visualize this by loading it into Xmgr.

        [choose `Read block data' from the menu; select the .dat file; set X from column 1 and Y from column 2; press Accept; press the autoscale button just below the Draw button in the main window. Other options may be selected from the menu to change tick labels, add titles etc.]

        In addition, we can check the `stereochemical quality' of the non-energy minimized average using PROCHECK.

        Q: How does it compare to that of the individual structures of the ensemble? and how to the energy minimized average?

        As you can see the average structure is energy-minimized to `repair' any structural anomalies. Take a look at the energy minimization part of the CNS script and identify all the terms that are used in the forcefield. In case the stereochemical quality of the energy-minimized structure is not satisfactory, one can increase the number of minimization steps to try to improve the quality of the minimized average structure.

    Your tasks:

    Now you have been introduced to structure calculation and validation you are required to run at least three different protocols in addition to the standard reference protocol described above in the intructions. We will use these runs for evaluation at the end of the course. The scheme is:


    Name of the run




    This is the reference run described in the intructions above. Each group is required to run it and use it further for comparison with the other runs.

    Description (compared to the reference run above!!!)






    Leave out the hydrogen bonds

    Add dihedral angle restraints based on chemical shift index. Enter dihedrals_csi.tbl at other restraints file entry

    Add dihedral angle restraints; leave out hydrogen bonds







    Use dihedral energy term (use different parameter file: *param_dihe)

    Use hydrogen bonding restraints between residues i,i+4 only. Use hbonds_csi_i-i+4.tbl (different from the previous one, check it!)

    Use hydrogen bonding restraints between residues i,i+4 only; use torsion angle dynamics (50000K, 1000 md-steps for hot md; 50000K, VDW scale factor=1 for cooling stages)






    Use unambig_4.5A.tbl and ambig_4.5A.tbl for NOE restraints

    Use unambig_5.5A.tbl and ambig_5.5A.tbl for NOE restraints

    Use unambig_wcal.tbl and ambig_wcal.tbl for NOE restraints. These tables are created with the NOE intensities scaled with a factor of 5 and are therefore not correctly calibrated.





    As point 2 in Scheme 1

    The same but with dipolar couplings: extra files tensor.param; tensor.psf; ./nmr_data_manual/lac_rdc_cns.tbl; lac_extended_tensor.pdb (instead of lac_extended.pdb) DFS=0; Axial=4; Rhombicity=0.5

    The same but with Torsion Angle Dynamics instead of Cartesian (see point 3 Scheme 2)


    Scheme of group 1


    Scheme of group 2



    Scheme of group 3



    Scheme of group 4




    Use unambig_false_ass.tbl instead of unambig.tbl for NOE restraints

    In this file a number of wrong assignments have been randomly introduced. Try, by analysing the violations to identify them.

    Try to set-up the calculations in series overnight on one machine, so next day you are ready to analyse all of the runs. Another interesting exercise (which only takes very little time) is to calculate the structures using the default values in the anneal.inp script. You can do this for both Cartesian and Torsion angle dynamics.

    Note: Since several groups will be running the same protocols, to avoid obtaining exactly the same results, change the random seed number in anneal.inp.

    Analyse and compare the resulting structures.

    Consider the following questions:

    Q: How does Torsion Angle Dynamics compare to Cartesian dynamics in terms of cpu-time (Hint: look at the last lines of the CNS output files) and numbers of accepted structures?

    Q: Does the use of chemical shift data (translated to hydrogen bond- and/or dihedral angle restraints) improve the structures in terms of quality? And in terms of efficiency (=number of accepted structures)?

    Q: What happens when using the conservative approaches using only distance-restraints of 4.5 or 5.5 Angstrom? What do you think of this method?

    Q: How do the structures calculated using the default input parameters for the anneal.inp script compare to those used in the standard run? Elaborate on this. What is your conclusion?

    5. Structure calculation using ARIA scripts and manually assigned NOE data

    The main advantage of ARIA is its capabilities of doing automatic NOE assignments in raw NMR data. However, the program also allows the use of manually assigned and calibrated NOE-data or combinations of manual and automatic assignments. We will start of here with the use of manually assigned NOE data only as input for the ARIA scripts and do some variations on the input data (such as in the previous sections)

    The basic run (run1) that all students will do is described in this section.

    First open the ARIA page in your bookmarks:


      ARIA.tbl format (already calibrated data)

      Read the instructions carefully and enter:

        Do NOT modify the path to the current ARIA program directory!

        Path of the new project

        run number (1)

        unambiguous datalist (unambig.tbl)

        ambiguous datalist (ambig.tbl)

        Residual dipolar couplings 1 (lac_rdc.tbl)

        sequence file (lac.seq)

        chemical shift index file (lac_csi.out)

        CSI dihedral restraints errors: 30

        Make sure that all fields that are not used are blank!!

      Now save the file as new.html

    Run aria in the winterm:


    This will create a directory tree necessary for the calculations under ./run1/


    Before we continue we need to provide ARIA with a file for stereo-specific assignments. In principle ARIA can do these assignments using a swapping routine (swaps atoms names) in the course of the calculation. However, when the stereo-specific assignments have been done using experimental methods this is of course the way to go.

    In the case of hp62 we have stereo-specific assignments for all leucine and valine prochiral methyl groups. To use them do the following:

    cp ./nmr_data_manual/stereoassign.cns ./run1/data/sequence/ 

    The next thing we have to do is to edit the ./run1/run.cns file to define which protocol we will use. To do so go back to the main ARIA page and enter the location of the run.cns file.

    First we have to enter/check some file and directory names:

      Project name

      Project directory

      Run directory

      Template pdb file

      Protein structure file

      Sequence file

      Now only change the fields listed below:

      Check the run_number in the readout directory

      Set the use of CSI derived hbonds restraints to TRUE

      Set the type of non-bonded parameters to PARALLHDG

      Do you want to include dihedral angle energy terms: false

      Use for the queing command: csh

      Set the number of jobs to 1

      Use for the cns executable the absolute path of cns (type: "which cns" in the winterm to see this).


    For the SA protocol use:

      Type of molecular dynamics: Cartesian

      Initial temperature for TAD: 2000

      Initial temperature for TADCartesian dynamics: 2000

      Final temperature after first cooling step: 1000

      Final temperature after second cooling step: 50

      Cartesian time step: 0.005

      Factor for timestep and number of steps in TAD: 9

      Initial number of MD steps: 10000

      Number of MD steps for refinement: 2000

      Number of MD steps during first cooling stage: 10000

      Number of MD steps during second cooling stage: 3000

      First iteration for cartesian refinement (after TAD): 0

      Use the water refinement for the last iteration and select 10 structures

    Save the updated run.cns file (overwrite)


    In the winterm go to ./run1/

    and run aria and redirect the output to a log file

    aria1.1 > &aria.log& 

    Your tasks:

    Similar to what you did for the CNS structure calculations you are required to run, next to the standard protocol described above (run1), at least one other structure calculation protocol using ARIA scripts and manually assigned data. The scheme is:


    Name of the run




    This is the reference run described in the intructions above. Each group is required to run it and use it further for comparison with the other run.

    Description (compared to the reference run above!!!)



    No water refinement



    Use PROLSQ parameters for the forcefield (widely used crystallographic forcefield)



    Use PARAMALLH6 parameters for the forcefield (CHARMM VdW radius scaled to 0.8 instead of 0.78)



    Use dihedral angle energy term



    Use torsion angle dynamics with default parameters (for TAD!!)



    Use Cartesian dynamics with default parameters (for Cartesian dynamics!!)



    As run 6, but with dipolar couplings (put dipolar couplings class1 to SANI use Ax=4 and Rh=0.5)



    As run 7, but with dipolar couplings (put dipolar couplings class1 to SANI use Ax=4 and Rh=0.5)

    5. Structure calculation using ARIA and automatic assignment of NOE data

    The final structure calculation protocol of this course will use the automated assignment possibilities in ARIA.

    The basic run (run1) that all students will do is described in this section.

    On the ARIA starting page select:

      Regine format (uncalibrated data)

    Edit the directory and file names:

      Current ARIA program directory:

      Path of the new project:

      Run number:

      For the spectra use the following input:

        SPECTRUM 1:

          Name: 15N

          Peaks file: lac_15N-NOE-HSQC.pks

          Shifts file: lac.shifts

          PPMD for Heteronucleus 1: 0.5

          PPMD for Proton 1: 0.05

          PPMD for Heteronucleus 2: (leave empty)

          PPMD for Proton 2: 0.05

        SPECTRUM 2:

          Name: 13C

          Peaks file: lac_13C-NOE-HSQC.pks

          Shifts file: lac.shifts

          PPMD for Heteronucleus 1: 0.5

          PPMD for Proton 1: 0.05

          PPMD for Heteronucleus 2: (leave empty)

          PPMD for Proton 2: 0.05

        SPECTRUM 3:

          Name: 15N15N

          Peaks file: lac_15N-HMQC-NOE-HSQC.pks

          Shifts file: lac.shifts

          PPMD for Heteronucleus 1: 0.5

          PPMD for Proton 1: 0.05

          PPMD for Heteronucleus 2: 0.5

          PPMD for Proton 2: 0.05

        SPECTRUM 4:

          Name: 2D

          Peaks file: lac_2DNOE_H2O.pks

          Shifts file: lac.shifts

          PPMD for Heteronucleus 1: (leave empty)

          PPMD for Proton 1: 0.05

          PPMD for Heteronucleus 2: (leave empty)

          PPMD for Proton 2: 0.05

        Sequence file: lac.seq

        Residual dipolar couplings 1 (lac_rdc.tbl)

        Chemical shift index file: lac_csi.out

        CSI dihedral restraints errors: 30

    Make sure that all fields that are not used are blank!!

    Save the file as new.html

    run aria in the winterm:


    As in the previous exercises this will create a directory tree necessary for the calculations under ./run1/

    Copy the stereo-specific assignments for all leucine and valine prochiral methyl groups into the run directory.

    cp ~/nmr_data_manual/stereoassign.cns ./run1/data/sequence/ 

    The next thing we have to do is to edit the ./run1/run.cns file to define which protocol we will use. To do so go back to the ARIA-starting page and do the editing.

    Again, enter and check file and directory names.

    Check that the CNS to IUPAC conversion is set to true

    Now only change the fields listed below:

      Set the use of hydrogen bond restraints to false

      set the use of dihedral restraints to false

      Set the use of CSI derived hbond restraints to true

      Set the use of CSI derived dihedral restraints to true

      set the type of non-bonded parameters to PROLSQ

      Do you want to include dihedral angle energy terms: true

    For the spectra parameters:

      set qrelax to false for all spectra

    For these calculations we will not use the parallel jobs option

      use for the queing command: csh

      use for the cns executable the absolute path of cns (type: which cns in the winterm to see this).

      use 1 job at a time

    For the SA protocol use:

      type of molecular dynamics: Cartesian

      initial temperature for TAD: 2000

      initial temperature for TADCartesian dynamics: 2000

      final temperature after first cooling step: 1000

      finale temperature after second cooling step: 50

      drop in temperature (K) per cycle of dynamics: 50

      Cartesian time step: 0.003

      factor for timestep and number of steps in TAD: 9

      initial number of MD steps: 10000

      number of MD steps for refinement: 2000

      number of MD steps during first cooling stage: 5000

      number of MD steps during second cooling stage: 2000

      first iteration for cartesian refinement (after TAD): 0

      Use the water refinement for the last iteration and select 10 structures

    Save the updated run.cns file (overwrite)

    In the winterm go to ./run1/ and run aria

    aria1.1 > &aria.log &

    Since we are running aria in a single processor, this will take quite some time (about 2 to 3 days).

    Your tasks:

    Next to the standard protocol described above you are required to run one additional protocol. The second run will however be performed in Nijmegen. The scheme is:


    Name of the run




    This is the reference run described in the intructions above. Each group is required to run it and use it further for comparison with the other run.

    Description (compared to the reference run above!!!)



    No CSI data (set CSI dihedrals and hbonbs to false)



    only CSI hbonds data (set CSI dihedrals to false)



    only CSI dihedral data (set CSI hbonds to false)



    Maximum number of possible assignments set to 20 instead of 5 (MAXN for each iteration)



    use small errors on the peaks (0.025 instead of 0.05 for protons)



    use large errors on the peaks (0.10 instead of 0.05 for protons)



    as run1 but with dipolar couplings (use Ax=4 and Rh=0.5)



    as run1 but with dipolar couplings (use Ax=1 and Rh=0.5)

    A number of scripts to compare the manually assigned peaks with the automatic assignment results from ARIA are provided in the ./nmr_data_aria/compare directory.

    To run the comparison, copy the entire compare directory in the same directory where the run (e.g. run1) directories are present. Edit the run_comparison.csh script and change the run number (define by the variable ir), e.g.:

    # define aria run number
    set ir=3

    and then run the comparison by typing in your winterm:


    This will create a run directory containing results for each spectrum and ARIA iteration. Check the .out files and compare the number of assignments as a function of the iteration number. How many identical and different assignments? How do the various protocols compare? Gather this information from the other teams as well to obtain an overview of the various protocols and decide which setup was the best one in terms of correct/incorrect assignments.

    Also compare the ARIA generated lac structures with the ones obtained from manual assignments: how close are they? What are the differences (e.g. rmsd, secondary structure, stereochemical quality). Use for the comparison the structures obtained with the standard reference run in CNS.

    Note: You can use molmol to calculate rmsd between two structures. Another possibility is to use the program profit. The following example calculates various rmsd values between a reference structure (lac_std.pdb) and another structure (lac_aria.pdb) (the comments after ! are explanations and should not be used as commands within profit):

      reference lac_std.pdb ! reads in the reference structure
      zone 5-45             ! defines zone in residue numbers for the fitting
      atoms CA,C,N          ! defines the atoms used for fitting
      mobile lac_aria.pdb   ! reads in the structure to be fitted
      fit                   ! performs the fitting. You can write out the fitted
                            ! structure if you wish with the command write "filename"
      rzone 1-62	        ! calculates rmsd for the entire backbone
                            ! To define a new zone (or rzone) first clear the old
    			! definition with zone (or rzone) clear
      ratoms ^H*            ! calculates rmsd for all heavy atoms (H not included)

7. Validation of NMR structures


In this part of the course you will be made more familiar with some widely used tools to check the quality of protein structures. Here we will distinguish between the quality in terms of how well the structures represent (fit) the experimental data, and the quality in terms of local and overall geometry.

Validation of structures against experimental data

In a typical NMR structure determination one first tries to find a set of structures that fulfil the restraints derived from experimental procedures, such as NOE-derived distance restraints, dihedral angle restraints, hydrogen bonding restraints and residual dipolar couplings. To do this an ensemble of structures is calculated and a subset of this ensemble is selected that fulfils the input restraints to a certain degree. In a sence this is already validation of the structures in terms of experimental data (actually it's the most important criterium).
Selection of structures is usually done using certain cut-off criteria for the degree of violation of the input restraints. For example, structures should not contain violations of distance restraints larger than 0.2 or 0.5 Angstrom and dihedral angle violations of 5 degrees. These values are quite arbitrary and depend on the estimation of the error on the restraints, and whether tight- or loose restraints were used in the calculation. It must be noted that there is no real consensus on which criteria to use. Sometimes one even finds structures in the databases that contain some larger violations. This may be due to incorrect treatment of the distance restraint, or to the presence of multiple conformations. Another problem with these selection procedures is that often restraints are being modified or even left out of calculations. You will understand now that it is hard to assess the true meaning of statements like "structures contained no violations of distance restraints larger than 0.2A" without having the original data available. Original data in this case means spectra, and not just the restraint tables that are sometimes deposited with structures in the databases.
The validation against experimental data is usually done using the programs that were used to calculate the structures. XPLOR and CNS for example provide the restraint violations and in addition, as another measure of the fit of the structures to the data, the RMS deviation for restraints. With respect to the latter one can say that in general the lower the RMSD value is the better the fit of the structure to the data. Another program that is sometimes used for validation of experimental restraints is AQUA . This program also provides tools for interconversion of restraints between different formats and checking the completeness of your data-sets.

So far for the fit of the structures to the experimental data........

Validation of local and overall geometry of NMR structures

Now we will focuss a bit more on other generally used quality indicators for structures. Let's start with a statement:

"NMR does not provide enough experimental information to determine the structure of a biomolecule. The best we can do is to derive a model based on the NMR data and a description of the molecular topology and intra- and inter molecular forces."

How nice.....

The reason I am stating this is that it is possible to build models that fulfill the NMR data perfectly but would be completely wrong if not a lot of chemical/physical information on the molecular topology was used. You will say now: OK, just use that information and build the model. As you will see, this is exactly the point where things often go wrong in NMR structure determination.
In contrast to NMR, X-ray crystallography does allow the determination of a structure, provided that the resolution of the data is high enough. This does not mean that no errors are made in very high resolution, well-refined X-ray structures. They are just more reliable structures. At lower resolutions the final result of X-ray structure determination also depends a lot on the quality of the model building.
The validation of structures, both X-ray and NMR derived, is done using reference data sets. These reference data sets are made up of high resolution X-ray structures of small molecules and biomolecules. For parameters describing the covalent geometry, i.e. bond lengths and bond angles, Engh & Huber derived very precise reference averages and standard deviations from the Cambridge Small molecule Database (CSD). Other parameters concerning the local structure properties that are usually checked are the omega angle, chirality deviations and side-chain planarity. The reference values for these parameters were derived from high resolution X-ray structures of proteins. It might seem trivial to treat these parameters properly in structure calculations, however, practice shows us that this is often not the case. An important example is the fact that in many NMR structures calculations the omega angle is kept very close to 180 degrees (+/- 2 degrees or so), whereas we know that the standard deviation for omega angles is around 6 degrees. Such a difference has a considerable effect on the freedom of the protein backbone and the Ramachandran plot.
Other checks that are done deal with overall quality of structures. The best known example is the Ramachandran plot which provides combinations of the phi- and psi backbone dihedral angles that are favoured or not allowed. This is a very important quality indicator and for NMR structures there is a relation between the number of restraints in a certain region and its the appearance in the Ramachandran plot. Another important example is the validation of the rotameric states of side chains (Chi1-Chi2). If the force field does not include energy terms for side chain dihedral angles, energetically unfavourable eclipsed conformations may occur. Another check I want to mention here is the packing quality of protein structures. Different amino acids prefer different environments and this check provides a measure of how well a residue is packed in a protein.
Finally, problems arising from simplified treatment of the non-bonded interactions in NMR structures should be discussed. For example, Van der Waals interactions are often treated as if atoms were hard spheres with only a repulsive term (REPEL option in XPLOR/CNS) and electrostatics usually are not used at all. This simplified treatment of the van der Waals and electrostatic interactions is applied to speed up the structure calculation process, but, as we will soon see, causes some problems in the structures.
One of these problems is the presence of so-called "bad contacts", or inter-atomic overlaps. Small bumps are not necessarily wrong, however larger overlap is physically unrealistic and the occurence of many inter-atomic bumps indicate that the structure was not refined properly and is not to be trusted.
The absence of electrostatic forces in the force field may lead to unrealistic charge distributions and non-optimal hydrogen bonding networks in the considered molecule. This becomes particularly evident for charged side chains which are allowed in conformations that are not very likely considering the local electrostatics. (insert some references)

Protein structure validation software

The programs we will use to check our structures are PROCHECK, PROCHECK_NMR and WHAT IF. WHAT IF provides by far the most information and is more critical than PROCHECK. However, we will illustrate the use, similarities and differences in all these programs because they are widely used in the structure community. For a more elaborate explanation of the checks that these programs perform have a look at the

Practical part

First we will start playing around a little with 1 structure, just to get familiar with the software. Take a structure from your ensemble that is good in terms of fit to the experimental data. Now run procheck on this structure:

You will get a number of files summarizing the results. Inspect the postscript files and go through the .out and .sum file. Try to understand the output.

Now do an analysis of the complete ensemble using PROCHECK_NMR and look at the differences in PROCHECK and PROCHECK_NMR outputs.

Important note: often people use only the so-called well-defined regions of their structures in the PROCHECK_NMR analysis. This improves the apparent quality of the structures, especially for the Ramachandran plot. But that is only estetics and you must be careful on how you present such analyses. It should not be used for hiding possible problems in less well-defined regions.

Now we will do some more critical checks using WHAT IF.

You can now read the report in the log file. Go through it and try to understand the checks that have been performed. Pay special attention to:

At the bottom of the file you will find the summary of some of the most important checks. Now try to understand the output. Take a look at the structure with molmol and see if you can actually see the problems in the structure.

Now, just to make you feel better about your own structure and to illustrate the use of validation, have a look at structures 1i1s and 1ka3 (take any member of the ensembles in /home1/coursead/1i1s/ and /home1/coursead/1ka3/).

Now that you have become familiar with the checking software we want to get some statistics on the structure ensembles you have calculated in the first part of the course. To keep you from having to run all checks on individual structures I have written a script that does it for you and provides the statistics in a summary file. The script only runs on machines obiof1, obiof10 and obiof11 (sorry, don't blame me). So we have to divide the jobs a little (ask me!).

Do the following:

The script should be run on each ensemble you calculated. The statistics you will get for each ensemble can now be used for comparison of the different calculation protocols. Your final task is now to do this comparison and to prepare figures for the presentation. Try to illustrate the results using both the PROCHECK and WHAT IF results. For example, it will be very illustrative for the understanding of the "abstract" Z-scores for the Ramachandran and Chi1-Chi2 rotamer normality if you compare them to the plots generated by PROCHECK. Try to find the correlation between the Z-score and the percentages in the most favoured regions of the Ramachandran plot. Do something similar for the Chi1-Chi2 distributions (use the plots). If you do not have enough data to get a correlation, then use the results from the other groups.

Good luck!!

8. Instructions for the presentation

Every group has to present some of their results that were obtained during the course. The time for one presentation is 10-15 minutes, followed by questions and discussion (~5 minutes). Since their is only limited time for the presentation try to present your results in a compact manner. Discuss the essential differences (i.e. acceptance scores, quality parameters, noteable differences in global fold etc) in the results of the structure calculations using the various protocols and various input datasets.