BNRA course 2004

"From Data to Structures"

Program for the practical
"Structure calculation and validation"


Chris Spronk and Aart Nederveen, May 2004


1. General remarks

The intention of the practical program of the 2004 BNRA course "From data to structures" is to make you familiar with the basics of NMR structure calculation and validation. Based on our previous experiences in giving this practical course, we have chosen to lead you through the coarse trajectory that one would follow when working on a biological macromolecule (see below). This means that a lot of the details of structure calculation will be skipped and some of the software will be used as a "black box". Our aim is that from the basic knowledge that is learned throughout the course, you will be able to set up your own structure calculations, and independently expand your knowledge on this topic to higher levels.
In practice, this means that instead of running many variations of different types of structure calculation protocols, a simple "standard" structure calculation protocol based on NOE-derived distance restraints and J-coupling-derived dihedral angle restraints will be performed. The calculated structures will then be analysed in detail using various structure validation methods and visual inspection. If time allows, it is possible to expand the practical to run structure calculations with other types of restraints and/or structure calculation protocols.

2. Setup

In order to reflect the reality of an NMR structure determination to some extent (and to allow for some variation in the final presentations), we have chosen to take a project -based approach. This means that each group will be assigned to calculate a set of structure models for a given protein. From these structures you will select "good" structures, investigate their global and local properties, identify possible problems in the structure etcetera. Following the detailed analysis of the structures, the focus shifts towards the biological relevance of the protein. This means, you will identify the regions in the protein structure that are of main biological interest, such as binding sites, active sites and so on (if present of course), and using the available literature and structures from the Protein Data Bank (PDB) to draw conclusions. You will present, compare and discuss your findings in a final presentation at the end of the course.

For each project the following is provided:
  • The deposited pdb-file
  • NMR distance and dihedral angle restraints in XPLOR/CNS format
  • The original paper describing the structure in pdf-format
The general goals for each project are:
  • Calculate and refine a structure ensemble using CNS
  • Analyse the structural properties and general quality of the molecule
  • Compare to the results described in the literature
  • Prepare a final presentation of the results

Table 1: Projects
Group Project
Login
Computer
1
1txa
course5
bug28
2
1txa
course6
bug29
3
1kdf
course7
bug30
4
1kdf
course8
bug31
5
1j0t
course9
bug32
6
1j0t
course10
bug33
7
1b3c
course11
bug34
8
1b3c
course12
bug35
9
1b3c
course13
bug36


3. Software

3.1 Structure calculation

A large number of programs and protocols can be used to calculate macromolecular structures from NMR data. In this practical we will however concentrate on the use of one particular program, CNS ( "Crystallographic and NMR System") from Axel Brünger's laboratory. CNS is the follow-up of the widely used program X-PLOR. More information on both programs can be found at through the links at http://atb.slac.stanford.edu/ .

3.2 Structure validation

Various different programs are available for structure validation. In this course we will use the two most powerful and widely used packages PROCHECK and WHAT IF. PROCHECK provides nice illustrative outputs, WHAT IF is more powerful but less user-friendly. Throughtout the course you will be made familiar with the strengths of both programs.

PROCHECK - Single structure analysis
PROCHECK-NMR - Structure ensemble analysis
PROCHECK-COMP - Comparison of two structures
WHAT IF - Single structure and/or ensemble analysis ( help pages )

3.3 Structure visualization

Throughout this course we plan to use molmol as a viewer. Maybe you are already familiar with this program. If not, take some time during this course to read the tutorial on the web. Several visualization tools that might be interesting for you are listed below.
PyMol - Powerful viewer ( tutorial )
Molmol - Powerful viewer and analysis program, but not very intuitive in its use (manual tutorial )
Rasmol - A simple and easy to use viewer, nice for quick inspection but rather limited in its capabilities ( tutorial )
Yasara   - Not used in this practical, but it is an extremely powerful and user friendly viewer and therefore worth mentioning, all you need is a good graphics card...)


4. Literature

The following papers are good overviews and starting points for reading:
  • P. Guentert. Structure calculation of biological macromolecules from NMR data. Quarterly reviews of biophysics 31, 145-237, (1998).
  • C.A.E.M. Spronk , S.B. Nabuurs, E. Krieger, G. Vriend & G.W. Vuister. Validation of protein structures derived by NMR spectroscopy. Progress in NMR spectroscopy, in preparation.
  • S.B. Nabuurs, C.A.E.M. Spronk , G. Vriend & G.W. Vuister. Concepts and tools for NMR restraint analysis and validation. Concepts in Magnetic resonance , accepted .


5. Practical aspects

5.1 Time schedule

The coarse time-schedule for the practical is given in Table 2. Depending on the speed of progress there is room for more exercises.

Table 2. Coarse schedule
Date
Program
Tuesday, May 18
  • Introduction to the practical
  • Generation of molecular topology
  • Generation of extended starting structure
  • Simulated annealing structure calculation and solvent refinement (overnight)
  • Tutorial structure visualization
Wednesday, May 19 Break: Introduction to structure validation
  • Start of validation of "project" structures (late afternoon)
May 20 - May 23
Spring break
  • Literature study (not obligatory, but useful for yourself)
Monday, May 24
  • Answers to questions so far... (1 hour)
  • Analysis of "project" structures
    • Inspection of validation reports
    • Visual inspection
    • Structure classification
    • etcetera
Tuesday, May 25
  • Continuation of analysis of project structures
Wednesday, May 26
  • Comparison of results to literature
  • Optional: Analysis of structures calculated using various types of data and protocols
Thursday, May 27
Break: RDC practical by Blackledge
Friday, May 28
  • Preparation for the presentation


5.2 Directory setup, files and scripts

All the necessary data and files are provided in your home-directory. For group 1 the homedirectory is /home/course5, for group 2 /home/course6, etc. Below /home/course# should be read as /home/course5, home/course6, etc. The scripts that you will be running throughout the course will create subdirectories within your home directory that contain all the output files. An overview of the relevant scripts, files and subdirectories that are created is listed in Table 3.

Note that most scripts are freely available from the RECOORD database


Table 3. Actions, outputfiles and directories
Action
Runcommand
 Start in dir:
Output
Generate molecular topology
generate.sh*
project/
project_cns.mtf**
Generate extended starting structure
generate_extended.sh
project/
project_cns_extended.pdb
Run simulated annealing
refineLongSep.sh
/home/course#/
- structures:
  str/
     project_cns_1.pdb, ..., project_cns_100.pdb
- job files, CNS input and output files:
  cnsRef/
Do refinement in water
re_h2o.sh
/home/course#/
- structures:
  str/wt/
     project_cns_w_1.pdb, ..., project _cns_w_25.pdb
- job files, CNS input and output files:
  cnsWtRef/
Do validation using WHATCHECK and PROCHECK
validPDB.sh
str/wt
- validation results (summary files):
  v_project_cns_w/whatcheck
     WHATCHECK_project_cns_w.SUM
  v_project_cns_w/procheck
     PROCHECK_project_cns_w.SUM
Analysis of violations
calcViolOrg.sh
project/
violations/
  viol_project_cns_w_0.3
Summary of violation analysis
analysViol.sh
project/
violations/  
 
viol_results
*For an explanation of  the scripts and its options, type the script name without any command line arguments and press enter.
**Directories are indicated in bold; project in italic (as part of a filename) indicates the project ID that has been assigned to you (see Table 1), e.g. project_cns_1.pdb = 1txa_cns_1.pdb for groups1 & 2, 1kdf_cns_1.pdb for groups 3 & 4 etcetera. Only files relevant to the course are specified.



Finally your home directory will be organized as follows:

Main directory
Subdirectory tree
/home/course#/ - project/ - str/ - wt/ - v_project_cns_w/ - procheck/
- whatcheck/
- cnsRef/
- cnsWtRef/
- violations/



- ValidationPractical/
- 1ka3/
- procheck/
- whatcheck/

- 1i1s/
- procheck/
- whatcheck/


5.3 Further guidelines for running the scripts

Now you will start using the scripts mentioned in Table 3. All script names are in bold-italic in this section. All calculations should be carried out on the computer that is assigned to you in Table 1. To remote login to these machines use 'ssh bugXX', where XX stands for the number of the computer in Table 1 (e.g. group 1: 'ssh bug28').
After a succesful login you can start typing your commands:
  • Generate the molecular topology (project_cns.mtf file) from the primary sequence or pdb-file.
    • Use the script generate.sh.
    • The generation of the topology file may take 1-2 minutes. Check the files that are now created.
  • If applicable, check the following (look in the topology file and the original .pdb file):
    • What do you think is the difference between the .pdb file and the .mtf file?
    • What is the difference between atom names and atom types in the topology file?
    • Protonation state of the histidines.
      Check the protonation state of the HIS residues in both the original .pdb file and in the .mtf file. They should be identical.
    • Presence or absence of disulfide bridges.
      Check if you can find them in the original .pdb file. They will be mentioned in de header of the .mtf file as well.
    • Occurences of cis-peptides.
      To find out if cis-peptides are present in the original .pdb file, you can load the .pdb file in molmol and then choose Calc -> Angles (select the omega angle ca-c-n-ca). If cis-peptides are present they are incorporated in the .mtf file. Just do 'grep CCIS' on your .mtf file and find out if carbon atoms are present with type CCIS. 
  • Generate a starting structure in an extended conformation
    • Use the script generate_extended.sh. Now an extended structure will be generated. Check this structure with molmol.
    • The starting structure will be the input in the simulated annealing protocol, from this the ensemble of structures will be generated
  • Now start the structure calculation (simulated annealing part):
    • Run the script refineLongSep.sh. Mind: this script should be run from your home directory, one directory lower than the project directory. The argument should be the project name. For example:
      /home/course4> refineLongSep.sh 1txa
    • This script does the following for you:
      • A CNS parameter file is generated, called run.cns. In this file the protocol that we use is specified.
      • Job files are generated for your models and they are processed consecutively on the bug you are working on. This will take a couple of hours, so use another window to check what is going on in the project directory. By default 100 models will now be generated. In general this will be more.
      • Have a look in the directory cnsRef, here you can find the CNS input and output files together with the job files.
      • When a model is finished, it is written in the directory str. Check the header for violations and energy values. For checking the violations and the energies you can use the aliases 'sortener' and 'sortviol' in the str directory.
      • The restraints that are used throughout the calculations are present in the .tbl files. There can be three different files: unambig.tbl (NOE distance restraints), hbonds.tbl and dihedrals.tbl. Which ones do you have? Have a look in those files and figure out how the experimental potentials (see lecture Bonvin) are built up from those numbers. Also calculate NMR data density for your project: the number of restraints per residue.
  • The structure calculation will take a couple of hours, in the meantime try to get familiar with the molmol program using the tutorial. Secondly, you can start reading about the structure in the paper. 
  • Optional: you can also redo this whole procedure by making a new dir project_var in your home directory and leave out the dihedral restraints, decrease the upper bounds by e.g. 10 % or add some errors in the restraints files. Copy the restraints and the original .pdb file (now project_var.pdb) in the directory project_var  and start again with the script generate.sh
  • When the structure calculation is finished (100 models should be present in the directory str), you can go for the next stage: refinement of your models in water. For this we use the script re_h2o.sh, which works similarly as the previous script. This script does the following:
    • In the directory str a new directory wt is created in which the 25 best energy structures are copied. These structures will now be refined and will get names like: project_cns_w_1.pdb, project _cns_w_2.pdb, etc.  Use the alias 'sortener' to find out if the lowest energy structures are now water refined by typing this command in the directories str and str/wt.
    • Check the directory cnsWtRef for the CNS input and output files and the job files.
  • For the 25 models that are refined in water we will run whatcheck and procheck. Go to the directory str/wt where the refined models are stored. Now the script validPDB.sh will do the work for you (first use it without argument to learn its use, only validate the water refined models with '_w_' in their filenames). This script waits until the waterrefinement stage is finished and the refined models are available, so you can use it immediately after starting the water refinement stage.
  • For calculating the RMSD use the secondary structure elements as specified in the paper and use these to superimpose the models using these regions in molmol. If the regions are not specified in the paper, you can find them in the procheck output.
  • Now you can use the scripts calcViolOrg.sh and analysViol.sh to check the distance restraints and dihedral angle violations in your waterrefined ensemble of 25 models.

5.4 Guidelines for the presentation

Every group has to present some of their results that were obtained during the course. The time for one presentation is 15 minutes, followed by questions and discussion (~5 minutes). Since there is only limited time for the presentation, try to present your results in a compact manner. 

Your presentation should be subdivided in an introduction, results and discussion section. Below we present some aspects that you may want to consider to include in your analysis and presentation.
  • Introduction of the project
    • Description of the protein: e.g.: fold, function, active sites, biological relevance, special features such as cis-peptides, disulphide bridges, etc.
    • Description of the NMR data: data density, data types
  • Results
    • Calculated ensemble of structures
    • Characteristics and quality (e.g.):
      • Secondary structure elements
      • Discuss value of rms deviations, energies, violations
      • Ramachandran plot
      • RMSD of ensemble
      • Other quality indicators
  • Discussion and conclusion
    • Discuss differences between procheck and whatcheck quality indicators
    • Identify problematic regions
    • Comparison with structure as described by the authors: discrepancies and agreement:
      • Quality indicators
      • Secondary structure
      • RMSD

Appendix 1. Introduction to structure validation

In this introduction to structure validation the use of the programs WHAT IF and PROCHECK will be demonstrated. Two rather recent NMR structures,1i1s and 1ka3, obtained from the PDB will be used to illustrate how to detect major problems in protein structures (note that these are outliers, but therefore very useful for illustration purposes). Instead of checking the complete ensembles, only the first model of each entry will be investigated.

The structures that you will check are:

/home/course#/ValidationPractical/1ka3/1ka3_001.pdb
/home/course#/ValidationPractical/1i1s/1i1s_001.pdb

  • PROCHECK:
    • In your home directory go to ValidationPractical/1ka3/procheck/
    • To run procheck on this structure type:

      procheck ../1ka3_001.pdb 2

      Explanation: Structure 1ka3_001.pdb is compared to X-ray structures of 2 Ångstrom resolution

  • WHAT IF:
    • In your home directory go to ValidationPractical/1ka3/whatcheck/
    • To start the program type:

      whatif

    • Set a parameter:

      setwif 593 100000

      Explanation: “setwif” is needed here to make the output of whatif very long (we need this for 1ka3). It sets the internal parameter 593 to 100000, which means that the output can be 100000 lines (not important to know for now, I just mention this for completeness)

    • Enter the WHAT IF checking menu:

      check fulchk ../1ka3_001.pdb ../1ka3_001.pdb

      Explanation: “check fulchk” is used to enter the structure checking menu in WHAT IF and starts the full checking of a structure (individual checks can also be run from the check menu if desired). This option asks twice for the filename of the structure that needs to be checked, so it is entered twice.

    • Quit the program when the checks are finished. On the first question enter “n”, and subsequently we stop the program with the “fullstop y”:

      n
      fullstop y

    • The output is written to “pdbout.txt” in the directory where whatif was run.


Now repeat the same procedure for structure 1i1s.
When you have finished running the programs investigate the graphical output of PROCHECK and the “pdbout.txt” file created by WHAT IF. Tips:

  1. Start with the summary at the end of the “pdbout.txt” file to get a quick indication of what might be wrong. In the case of 1i1s you will find a rather remarkable value for one of the validation parameters of which we can say with certainty that the forcefield that was used is bad. Find out which parameter this is. Also use the PROCHECK output to verify this. Have a look at the structure in the structure viewer.

  2. In both cases WHAT IF will complain a lot (the checks are very critical), however in the case of 1ka3, there is one property that the program needs particularly many lines for to list all the problems. Find out which one this is and compare the WHAT IF output to the PROCHECK output describing the same property. Have a look at the structure in the structure viewer.  From the websites describing the program output, can you think of reasons for the differences in output? And can you think of what possibly caused this structure to go wrong in such a major way?

  3. Now you have found the major problems in the structures, try to understand the rest of the output of PROCHECK and WHAT IF (if you have time left).

  4. Have a look at the residue specific Ramachandran plots displayed by PROCHECK. Investigate the green areas (=probability distributions: the greener, the more often a residue is found in this phi-psi combination in the reference data base) and look at the differences for different residues. Can you find residues that are more often found in helices than in sheets?