Protein Data Bank

Knowing the DNA sequence is just the beginning of understanding an organism. The proteins generated by the DNA fold into secondary structures, which in turn fold into tertiary structures, which in turn can fold into quaternary structures. The folds can bring close together regions of the DNA are distant along the DNA sequence. These nearby regions can work in concert. Most of the information we currently have about the 3D structures of macromolecules (including proteins, peptides, viruses, carbohydrates, etc.) has been obtained via X-ray crystallography or nuclear magnetic resonance studies. Here are X-ray images taken from a study of the muscle fibers of a bumblebee:

The shapes of the proteins can also be determined computationally, by modelling the dynamics between the components of the molecules, but this is extremely computationally expensive. There is a massively distributed effort to calculate the foldings of proteins, called Folding at Home. You can download a program onto your home PC that will occasionally download a folding problem from a central repository. Once on your machine, the folding program runs in the background on your PC (you can tweak the program so that it will run only when your PC is idle). Whenever the problem is solved, the answer is shipped back to the repository. There are currently more than a million PCs taking part in this effort. (There is a similar massively distributed project to search for extraterrestrial signals in radio telescope readings called SETI at home.)

More than a hundred thousand protein structures have been stored in the Protein Data Bank. The format of the data stored there is updated occasionally, so a good Bioinformaticist will probably have to update his Perl scripts for accessing the data from time to time. The structure of each macromolecule (including proteins) is stored in a single file. These files are grouped into directories. (Tisdall's book supplies a small directory of this type.) So we need to know how Perl accesses files and directories.

Directories and Perl

The opendir function is analogous to the open function. It opens a directory of a given pathname and stores a corresponding file pointer in a given variable:

opendir(FOLDER, '/tmp');

(Note, again, that the identifiers for file pointers don't start with a "$".) Once we've got a pointer to the directory, we can read the contents of the directory via the readdir function, and then close the file pointer (like any conciencious programmer should always do!)

@files = readdir(FOLDER);
closedir(FOLDER);

Here's a program (example11-1.pl) for opening a directory and listing its contents. You'll notice "files" named "." and ".." (do you know what those are?). We want to get rid of those entries. We'll use the grep function to do that (borrowed from the Unix OS). The syntax for grep is basically the same as for the matching operator. In addition, the directory entries listed by the program may, in turn, be subdirectories. In that case, we'd like to list their contents. This program (example11-2.pl) does that. Note the use of the -f operator to identify which pathnames correspond to files and the -d operator to determine which are directories.

Of course this could go on ad infinitum, so we'll just attack the problem recursively in this program (example11-3.pl). You might use the Perl debugger to examine the tracebacks at the recursive invocations of the list_recurively function (use the 'b', 'c', and 'T' commands).

Because traversing directories is such a common use of Perl, most Perl's now come with a module for this purpose, File::Find. (If your Perl doesn't have this, you can download it from www.CPAN.org, a web site you should become familiar with!) The iteration mechanism is similar to the mapcar and mapcdr functions in Lisp. The find function takes a reference to a subroutine and the pathname of a directory. It then applies the subroutine to each of the pathnames encountered during the recursive traversal of the given directory. Within that subroutine, the identifier $File::Find::name is bound to the pathname of the current file/directory within the traversal. This program (example11-4.pl) does the same thing as our earlier programs, but see how much less code is required by using the File::Find module.

Parsing PDB Files

PDB files tend to be quite lengthy, because they document every atom in the molecule. Before we start looking into parsing these files, you should take a look at this example. It is possible to read enough information from these files to render graphical images of the molecules. The PDB web site contains documentatation of the syntax description of these files.

Each line of a PDB files consists of at most 80 characters. Each line starts with a pre-defined record name. Each record type has a different number of fields, and these are denoted by columnar position. Some of the records will refer to other biological databases, and to records within this same file. (See Table 11-1 in the book for some examples.)

The SEQRES record type provides the amino acid sequence for the structure. Here is the documentation for the syntax.

Here is a part of a sample SEQRES taken from the documentation:

         1         2         3         4         5         6         7
1234567890123456789012345678901234567890123456789012345678901234567890
SEQRES   1 A   21  GLY ILE VAL GLU GLN CYS CYS THR SER ILE CYS SER LEU
SEQRES   2 A   21  TYR GLN LEU GLU ASN TYR CYS ASN
SEQRES   1 B   30  PHE VAL ASN GLN HIS LEU CYS GLY SER HIS LEU VAL GLU

The top two lines of the sample are not actually part of a file, they are so that you can see the column numbers. The first column after the "SEQRES" specifies a line number within the specification of a particular chain. The letter in the next column is an identifier for the chain. The next column specifies the number of "residues" in the chain. Each residue is specified via a three-character sequence.

The program example11-5.pl parses the SEQRES data in a sample PDB file to extract the amino acid sequence therein. The program consists of three major subroutines, which we now discuss.

The parsePDBrecordtypes subroutine takes an array, consisting of all the lines of the PDB file, and returns a hash. Each record-type serves as a key to the hash, with its associate value being a single string equal to the concatenation of the the records of that type. (Note that these will appear sequentially in the file, by definition.) If we run the program in debugging mode and stop right after calling parsePDBrecordtypes, you can see that $recordtypes{'SEQRES'} returns the concatenation of the original lines in their entireity:

  DB<3> p $recordtypes{'SEQRES'}
SEQRES   1 A  136  SER GLY GLY LEU GLN VAL LYS ASN PHE ASP PHE THR VAL
SEQRES   2 A  136  GLY LYS PHE LEU THR VAL GLY GLY PHE ILE ASN ASN SER

The next step is to extract from the SEQRES records only the sequence data (the sets of three-character sequences.) The extractSEQRES subroutines does that, returning an array of strings, each of which is a "chain", consisting of the concatenation of all the sequences from the lines of that chain. The example PDB file has just a single chain (labelled "A"), so the array has just one element. Here is that element:

main::(example11-5.pl:19):          print "$chain\n";
  DB<2> p $chain
SER GLY GLY LEU GLN VAL LYS ASN PHE ASP PHE THR VAL GLY LYS PHE LEU THR VAL GLY
GLY PHE ILE ASN ASN SER PRO GLN ARG PHE SER VAL ASN VAL GLY GLU SER MET ASN SER
LEU SER LEU HIS LEU ASP HIS ARG PHE ASN TYR GLY ALA ASP GLN ASN THR ILE VAL MET
ASN SER THR LEU LYS GLY ASP ASN GLY TRP GLU THR GLU GLN ARG SER THR ASN PHE THR
LEU SER ALA GLY GLN TYR PHE GLU ILE THR LEU SER TYR ASP ILE ASN LYS PHE TYR ILE
ASP ILE LEU ASP GLY PRO ASN LEU GLU PHE PRO ASN ARG TYR SER LYS GLU PHE LEU PRO
PHE LEU SER LEU ALA GLY ASP ALA ARG LEU THR LEU VAL LYS LEU GLU

Note the use of the substr function to grab the fields of each line. Also note that we choose to form the concatenation by iterating through the line of the input array, rather than by using a single regular expression. The iteration is more efficient here because we are guaranteed that all the lines of a record type will be consecutive. Consider how the function would work if there was more than one chain defined in the file.

The last step is to convert these three-letter specifications of amino acids into the single-character representations we've been using in the course so far (taken from chapter 4, pp.30-1.) The function iub3to1 does this.

  DB<3> p iub3to1($chain)
SGGLQVKNFDFTVGKFLTVGGFINNSPQRFSVNVGESMNSLSLHLDHRFNYGADQNTIVMNSTLKGDNGWETEQRSTNFT
LSAGQYFEITLSYDINKFYIDILDGPNLEFPNRYSKEFLPFLSLAGDARLTLVKLE
  DB<4>

Atomic Spatial Coordinates

The ATOM record type specifies the spatial coordinates of the atoms that comprise the protein. (Actually, there are several record types that do this, but we'll just stick with ATOM. Some of the other record types, like ORGIGXn, SCALEn, MTRIXn and TVECT are related to computer graphics algorithms that are used to render an image of the molecule via various linear algebra transformation operations.) Here is the syntax specification for the ATOM record type. And here is a line from a sample file (again the first two lines are there only to identify the column numbers, they are not actually part of the file):

         1         2         3         4         5         6         7         8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
ATOM    145  N   VAL A  25      32.433  16.336  57.540  1.00 11.92      A1   N

The example provides information about the atom with serial number 145, a Nitrogen atom, part of a Valine amino acid in chain A, sequence 25 of that chain. The next three floating point numbers (columns 33 to 54) are the x, y and z coordinates of the atom. The program example11-6.pl extracts these spatial coordinates for each atom. The design of the program is similar to that of the previous program, using parsePDBrecordtypes to form a hash of all the record types in the file. The parseATOM function returns a hash where each atom number hashes to the coordinates of that atom and the elemental name of the atom. Again, the program uses the substr function to extract individual fields.

We can use a much smaller code fragment to parse only the ATOM data of a PDB file:

while(<>) {
        /^ATOM/ or next;

        my($n, $x, $y, $z, $element)
           = ($_ =~ /^.{6}(.{5}).{19}(.{8})(.{8})(.{8}).{22}(..)/);

        # Get rid of excess whitespace
        $n      =~ s/^\s*//;
        $element =~ s/^\s*//;

        if (($n == 1) or ($n == 1078)) {
                printf "%8.3f%8.3f%8.3f %2s\n", $x, $y, $z, $element;
        }
}

There are a couple of things to note here. First, recall what the "{ }" terms in the regular expression do. Second, recall what the "( )" terms do. Third, the printf subroutine in the last line provides a means of generating formatted output. The syntax is borrowed from the C programming language. If you are unsure of the formatting syntax, you should check the perldoc.

Fourth, this is our first use of the "<>" operator to access the file contents. Suppose we have saved this code as the file, getAtoms.pl. There are a variety of ways to run the program using command line arguments. The first is to supply the name of the input file as a command line argument:

$ perl getAtoms.pl BeginningPerl/examples/pdb/c1/pdb1c1f.ent
   1.888  -8.251  -2.511  N
  -0.873   9.368  16.046  C

If no argument is provided, then the standard input stream, STDIN, is used. There are at least two ways to bind the contents of the input file to STDIN. The first uses a Unix-style pipeline operator. The second uses a Unix-style redirection operator:

$ cat BeginningPerl/examples/pdb/c1/pdb1c1f.ent | perl getAtoms.pl
   1.888  -8.251  -2.511  N
  -0.873   9.368  16.046  C


$ perl getAtoms.pl < BeginningPerl/examples/pdb/c1/pdb1c1f.ent
   1.888  -8.251  -2.511  N
  -0.873   9.368  16.046  C