BNRA course 2004
"From Data to Structures"
Program for the practical "Structure calculation and validation"
Chris Spronk and Aart Nederveen, May 2004
1. General remarks
The intention of the practical program of the 2004 BNRA course
"From data to structures" is to make you familiar with the
basics of NMR structure calculation and validation. Based on
our previous experiences in giving this practical course, we have chosen
to lead you through the coarse trajectory that one would follow when working
on a biological macromolecule (see below). This means that a lot of the details
of structure calculation will be skipped and some of the software will be
used as a "black box". Our aim is that from the basic knowledge that is
learned throughout the course, you will be able to set up your own structure
calculations, and independently expand your knowledge on this topic to higher
levels.
In practice, this means that instead of running many variations of
different types of structure calculation protocols, a simple "standard"
structure calculation protocol based on NOE-derived distance restraints
and J-coupling-derived dihedral angle restraints will be performed. The
calculated structures will then be analysed in detail using various structure
validation methods and visual inspection. If time allows, it is possible
to expand the practical to run structure calculations with other types of
restraints and/or structure calculation protocols.
2. Setup
In order to reflect the reality of an NMR structure determination to
some extent (and to allow for some variation in the final presentations),
we have chosen to take a
project -based approach.
This means that each group will be assigned to calculate a set of structure
models for a given protein. From these structures you will select "good"
structures, investigate their global and local properties, identify possible
problems in the structure etcetera. Following the detailed analysis of the
structures, the focus shifts towards the biological relevance of the protein.
This means, you will identify the regions in the protein structure that are
of main biological interest, such as binding sites, active sites and so on
(if present of course), and using the available literature and structures
from the Protein Data Bank (PDB) to draw conclusions. You will present, compare
and discuss your findings in a final presentation at the end of the course.
For each project the following is provided:
- The deposited pdb-file
- NMR distance and dihedral angle restraints in XPLOR/CNS format
- The original paper describing the structure in pdf-format
The general goals for each project are:
- Calculate and refine a structure ensemble using CNS
- Analyse the structural properties and general quality of the
molecule
- Compare to the results described in the literature
- Prepare a final presentation of the results
Table 1: Projects
|
Group |
Project
|
Login
|
Computer
|
|
1
|
1txa
|
course5
|
bug28
|
|
2
|
1txa
|
course6
|
bug29
|
|
3
|
1kdf
|
course7
|
bug30
|
|
4
|
1kdf
|
course8
|
bug31
|
|
5
|
1j0t
|
course9
|
bug32
|
|
6
|
1j0t
|
course10
|
bug33
|
|
7
|
1b3c
|
course11
|
bug34
|
|
8
|
1b3c
|
course12
|
bug35
|
|
9
|
1b3c
|
course13
|
bug36
|
3. Software
3.1 Structure calculation
A large number of programs and protocols can be used to calculate macromolecular
structures from NMR data. In this practical we will however concentrate
on the use of one particular program, CNS (
"Crystallographic and NMR System") from Axel Brünger's
laboratory. CNS is the follow-up of the widely used program X-PLOR. More
information on both programs can be found at through the links at
http://atb.slac.stanford.edu/
.
3.2 Structure validation
Various different programs are available for structure validation.
In this course we will use the two most powerful and widely used packages
PROCHECK and WHAT IF. PROCHECK provides nice illustrative outputs, WHAT
IF is more powerful but less user-friendly. Throughtout the course you will
be made familiar with the strengths of both programs.
PROCHECK
- Single structure analysis
PROCHECK-NMR
- Structure ensemble analysis
PROCHECK-COMP
- Comparison of two structures
WHAT IF
- Single structure and/or ensemble analysis (
help pages
)
3.3 Structure visualization
Throughout this course we plan to use molmol as a viewer. Maybe you
are already familiar with this program. If not, take some time during this
course to read the tutorial on the web. Several visualization tools that
might be interesting for you are listed below.
PyMol
- Powerful viewer (
tutorial
)
Molmol - Powerful viewer and analysis program, but not very intuitive
in its use (manual
tutorial
)
Rasmol
- A simple and easy to use viewer, nice for quick inspection but rather
limited in its capabilities (
tutorial
)
Yasara
- Not used in this practical, but it is an extremely powerful
and user friendly viewer and therefore worth mentioning, all you need
is a good graphics card...)
4. Literature
The following papers are good overviews and starting points for reading:
- P. Guentert. Structure calculation of biological macromolecules
from NMR data. Quarterly reviews of biophysics
31, 145-237, (1998).
- C.A.E.M. Spronk
, S.B. Nabuurs, E. Krieger, G. Vriend & G.W. Vuister. Validation of
protein structures derived by NMR spectroscopy.
Progress in NMR spectroscopy, in preparation.
- S.B. Nabuurs, C.A.E.M.
Spronk , G. Vriend & G.W. Vuister. Concepts and tools for
NMR restraint analysis and validation.
Concepts in Magnetic resonance , accepted
.
5. Practical aspects
5.1 Time schedule
The coarse time-schedule for the practical is given in Table 2. Depending
on the speed of progress there is room for more exercises.
Table 2. Coarse
schedule
Date
|
Program
|
Tuesday, May 18
|
- Introduction to the practical
- Generation of molecular topology
- Generation of extended starting structure
- Simulated annealing structure calculation and solvent refinement
(overnight)
- Tutorial structure visualization
|
| Wednesday, May 19 |
Break:
Introduction to structure validation
- Start of validation of "project" structures (late afternoon)
|
May 20 - May 23
|
Spring break
- Literature study (not obligatory, but useful for yourself)
|
Monday, May 24
|
- Answers to questions so far... (1 hour)
- Analysis of "project" structures
- Inspection of validation reports
- Visual inspection
- Structure classification
- etcetera
|
Tuesday, May 25
|
- Continuation of analysis of project structures
|
Wednesday, May 26
|
- Comparison of results to literature
- Optional: Analysis of structures calculated using various
types of data and protocols
|
Thursday, May 27
|
Break: RDC practical by Blackledge
|
Friday, May 28
|
- Preparation for the presentation
|
5.2 Directory setup, files and scripts
All the necessary data and files are provided in your home-directory.
For group 1 the homedirectory is /home/course5, for group 2 /home/course6,
etc. Below /home/course# should be read as /home/course5, home/course6,
etc. The scripts that you will be running throughout the course will create
subdirectories within your home directory that contain all the output files.
An overview of the relevant scripts, files and subdirectories that are created
is listed in Table 3.
Note that most scripts are freely available from the
RECOORD database
Table 3. Actions, outputfiles and directories
|
Action
|
Runcommand
|
Start in dir:
|
Output
|
Generate molecular topology
|
generate.sh*
|
project/
|
project_cns.mtf**
|
Generate extended starting structure
|
generate_extended.sh
|
project/
|
project_cns_extended.pdb
|
Run simulated annealing
|
refineLongSep.sh
|
/home/course#/
|
- structures:
str/
project_cns_1.pdb, ..., project_cns_100.pdb
- job files, CNS input and output files:
cnsRef/
|
Do refinement in water
|
re_h2o.sh
|
/home/course#/
|
- structures:
str/wt/
project_cns_w_1.pdb, ..., project
_cns_w_25.pdb
- job files, CNS input and output files:
cnsWtRef/
|
Do validation using WHATCHECK and PROCHECK
|
validPDB.sh
|
str/wt
|
- validation results (summary files):
v_project_cns_w/whatcheck
WHATCHECK_project_cns_w.SUM
v_project_cns_w/procheck
PROCHECK_project_cns_w.SUM
|
Analysis of violations
|
calcViolOrg.sh
|
project/
|
violations/
viol_project_cns_w_0.3 |
Summary of violation analysis
|
analysViol.sh
|
project/
|
violations/
viol_results
|
*For an explanation of the scripts and its options, type the
script name without any command line arguments and press enter.
**Directories are indicated in bold; project in italic
(as part of a filename) indicates the project ID that has been assigned
to you (see Table 1), e.g. project_cns_1.pdb = 1txa_cns_1.pdb for
groups1 & 2, 1kdf_cns_1.pdb for groups 3 & 4 etcetera. Only files
relevant to the course are specified.
Finally your home directory will be organized as follows:
Main directory
|
Subdirectory tree
|
|
/home/course#/ |
- project/ |
- str/ |
- wt/ |
- v_project_cns_w/ |
- procheck/
- whatcheck/ |
- cnsRef/
- cnsWtRef/
- violations/ |
|
- ValidationPractical/
|
- 1ka3/
|
- procheck/
- whatcheck/ |
|
- 1i1s/
|
- procheck/
- whatcheck/ |
5.3 Further guidelines for running the scripts
Now you will start using the scripts mentioned in Table 3. All script
names are in bold-italic
in this section. All calculations should be carried out on the computer
that is assigned to you in Table 1. To remote login to these machines use
'ssh bugXX', where XX stands for the number of the computer in Table 1 (e.g.
group 1: 'ssh bug28').
After a succesful login you can start typing your commands:
- Generate the molecular topology (project_cns.mtf file)
from the primary sequence or pdb-file.
- Use the script
generate.sh.
- The generation of the topology file may take 1-2 minutes. Check
the files that are now created.
- If applicable, check the following (look in the topology file
and the original .pdb file):
- What do you think is the difference between the .pdb file and
the .mtf file?
- What is the difference between atom names and atom types in the
topology file?
- Protonation state of the histidines.
Check the protonation state of the HIS residues in both the original
.pdb file and in the .mtf file. They should be identical.
- Presence or absence of disulfide bridges.
Check if you can find them in the original .pdb file. They will be
mentioned in de header of the .mtf file as well.
- Occurences of cis-peptides.
To find out if cis-peptides are present in the original .pdb
file, you can load the .pdb file in molmol and then choose Calc -> Angles
(select the omega angle ca-c-n-ca). If cis-peptides are present they
are incorporated in the .mtf file. Just do 'grep CCIS' on your .mtf file
and find out if carbon atoms are present with type CCIS.
- Generate a starting structure in an extended conformation
- Use the script
generate_extended.sh. Now an extended structure will be generated.
Check this structure with molmol.
- The starting structure will be the input in the simulated annealing
protocol, from this the ensemble of structures will be generated
- Now start the structure calculation (simulated annealing part):
- Run the script
refineLongSep.sh. Mind: this script should be run from your home
directory, one directory lower than the project directory. The argument
should be the project name. For example:
/home/course4> refineLongSep.sh 1txa
- This script does the following for you:
- A CNS parameter file is generated, called run.cns. In
this file the protocol that we use is specified.
- Job files are generated for your models and they are processed
consecutively on the bug you are working on. This will take a couple of
hours, so use another window to check what is going on in the project directory.
By default 100 models will now be generated. In general this will be more.
- Have a look in the directory cnsRef, here you can
find the CNS input and output files together with the job files.
- When a model is finished, it is written in the directory
str. Check the header for violations and energy values. For checking
the violations and the energies you can use the aliases 'sortener' and
'sortviol' in the str directory.
- The restraints that are used throughout the calculations are
present in the .tbl files. There can be three different files: unambig.tbl
(NOE distance restraints), hbonds.tbl and dihedrals.tbl. Which ones do you
have? Have a look in those files and figure out how the experimental potentials
(see lecture Bonvin) are built up from those numbers. Also calculate NMR
data density for your project: the number of restraints per residue.
- The structure calculation will take a couple of hours, in the
meantime try to get familiar with the molmol program using the tutorial.
Secondly, you can start reading about the structure in the paper.
- Optional: you can also redo this whole procedure by making a new dir
project_var in your home directory and leave out the dihedral
restraints, decrease the upper bounds by e.g. 10 % or add some errors in
the restraints files. Copy the restraints and the original .pdb file (now
project_var.pdb) in the directory project_var
and start again with the script generate.sh
- When the structure calculation is finished (100 models should
be present in the directory str), you can go for the next stage: refinement
of your models in water. For this we use the script
re_h2o.sh, which works similarly as the previous script. This
script does the following:
- In the directory str a new directory wt is
created in which the 25 best energy structures are copied. These structures
will now be refined and will get names like: project_cns_w_1.pdb,
project _cns_w_2.pdb, etc. Use the alias 'sortener'
to find out if the lowest energy structures are now water refined by typing
this command in the directories str and str/wt.
- Check the directory cnsWtRef for the CNS input and
output files and the job files.
- For the 25 models that are refined in water we will run whatcheck
and procheck. Go to the directory str/wt where the refined models
are stored. Now the script
validPDB.sh will do the work for you (first use it without argument
to learn its use, only validate the water refined models with '_w_' in their
filenames). This script waits until the waterrefinement stage is finished
and the refined models are available, so you can use it immediately after
starting the water refinement stage.
- For calculating the RMSD use the secondary structure elements
as specified in the paper and use these to superimpose the models using these
regions in molmol. If the regions are not specified in the paper, you can
find them in the procheck output.
- Now you can use the scripts
calcViolOrg.sh and
analysViol.sh to check the distance restraints and dihedral angle
violations in your waterrefined ensemble of 25 models.
5.4 Guidelines for the presentation
Every group has to present some of their results that were obtained during
the course. The time for one presentation is 15 minutes, followed by questions
and discussion (~5 minutes). Since there is only limited time for the presentation,
try to present your results in a compact manner.
Your presentation should be subdivided in an introduction, results
and discussion section. Below we present some aspects that you may want
to consider to include in your analysis and presentation.
- Introduction of the project
- Description of the protein: e.g.: fold, function, active sites,
biological relevance, special features such as cis-peptides, disulphide
bridges, etc.
- Description of the NMR data: data density, data types
- Results
- Calculated ensemble of structures
- Characteristics and quality (e.g.):
- Secondary structure elements
- Discuss value of rms deviations, energies, violations
- Ramachandran plot
- Discussion and conclusion
- Discuss differences between procheck and whatcheck quality
indicators
- Identify problematic regions
- Comparison with structure as described by the authors: discrepancies
and agreement:
In this introduction to structure validation the use of the programs
WHAT IF and PROCHECK will be demonstrated. Two rather recent NMR structures,1i1s
and 1ka3, obtained from the PDB will be used to illustrate how to detect
major problems in protein structures (note that these are outliers, but therefore
very useful for illustration purposes). Instead of checking the complete
ensembles, only the first model of each entry will be investigated.
The structures that you will check are:
/home/course#/ValidationPractical/1ka3/1ka3_001.pdb
/home/course#/ValidationPractical/1i1s/1i1s_001.pdb
- PROCHECK:
- In your home directory go to
ValidationPractical/1ka3/procheck/
- To run procheck on this structure type:
procheck
../1ka3_001.pdb 2
Explanation: Structure 1ka3_001.pdb
is compared to X-ray structures of 2 Ångstrom resolution
- WHAT IF:
- In your home directory go to
ValidationPractical/1ka3/whatcheck/
- To start the program type:
whatif
- Set a parameter:
setwif
593 100000
Explanation: “setwif” is needed
here to make the output of whatif very long (we need this for 1ka3). It
sets the internal parameter 593 to 100000, which means that the output can
be 100000 lines (not important to know for now, I just mention this for
completeness)
- Enter the WHAT IF checking menu:
check fulchk
../1ka3_001.pdb ../1ka3_001.pdb
Explanation: “check fulchk”
is used to enter the structure checking menu in WHAT IF and starts the
full checking of a structure (individual checks can also be run from the
check menu if desired). This option asks twice for the filename of the structure
that needs to be checked, so it is entered twice.
- Quit the program when the checks are finished. On the first
question enter “n”, and subsequently we stop the program with the “fullstop
y”:
n
fullstop
y
- The output is written to “pdbout.txt” in the directory where
whatif was run.
Now repeat the same procedure for structure 1i1s.
When you have finished running the programs investigate the graphical
output of PROCHECK and the “pdbout.txt” file created by WHAT IF. Tips:
- Start with the summary at the end of the “pdbout.txt” file to
get a quick indication of what might be wrong. In the case of 1i1s you will
find a rather remarkable value for one of the validation parameters of which
we can say with certainty that the forcefield that was used is bad. Find
out which parameter this is. Also use the PROCHECK output to verify this.
Have a look at the structure in the structure viewer.
- In both cases WHAT IF will complain a lot (the checks are very
critical), however in the case of 1ka3, there is one property that the
program needs particularly many lines for to list all the problems. Find
out which one this is and compare the WHAT IF output to the PROCHECK output
describing the same property. Have a look at the structure in the structure
viewer. From the websites describing the program output, can you think
of reasons for the differences in output? And can you think of what possibly
caused this structure to go wrong in such a major way?
- Now you have found the major problems in the structures, try
to understand the rest of the output of PROCHECK and WHAT IF (if you have
time left).
- Have a look at the residue specific Ramachandran plots displayed
by PROCHECK. Investigate the green areas (=probability distributions:
the greener, the more often a residue is found in this phi-psi combination
in the reference data base) and look at the differences for different residues.
Can you find residues that are more often found in helices than in sheets?
|