Reading and Writing PDB and SMD Files: An Exercise with FortranFormat

19 January 2010, Comments Comments Off on Reading and Writing PDB and SMD Files: An Exercise with FortranFormat

I recently released an open source Java library for parsing input and formatting output using Fortran Format. This library is very valuable for chemoinformaticists as many chemical file formats are easily parsed and generated with Fortran Format, and some even require it! You can learn more about the library, FortranFormat, at its official webpage, so I will not go into the details. Briefly, FortranFormat allows Java programmers to easily input and output data to and from strict column width formats.

This story will help to answer the following questions:

  1. How do I read or write a PDB file?
  2. How do I read or write a SMD file?
  3. How do I use FortranFormat?

There are several great reasons for adding FortranFormat to your parsing library. It is free and released under the Berkeley Software Distribution (BSD) license. It significantly reduces the amount of code you will need to tediously write, as we will see. You may use the same FortranFormat object to both read and write data lines, so you can create it once statically and reduce overhead. And, maybe most importantly, it conveniently works with the Autoboxing/Unboxing feature incorporated into Java 1.5.

In this story I will demonstrate how to use FortranFormat to parse and generate two chemical file formats: (1) Protein Data Bank (PDB) and (2) Standard Molecular Data 4.2 (SMD). PDB format is very popular for 3-dimensional data, while SMD format is very similar to other standard connection tables, but is defined with Fortran Format. Please note that I did not test the following code, so it may need some minor tweaking.

To begin, I will define three data structures: Atom, Bond and Molecule. They are self-explanatory. I will keep the code light and avoid more robust practices (using private members, getters and setters, error handling, constructors, etc.) for simplicity.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class Atom{
    public double x, y, z;
    public int charge;
    public String symbol;
}

class Bond{
    public Atom one, two;
    public int order;
}

class Molecule{
    public Vector<Atom> atoms;
    public Vector<Bond> bonds;
}

PDB

Let us start with PDB. The structure of a PDB file can be as simple or complex as you want, however, we will consider a basic example with just a header and an atom table. To learn more about the PDB format, please visit the RCSB Protein Data Bank co-managed by my alma mater, Rutgers. The entire PDB specification is quite large, at over 200 pages, with dozens of different record types, but we will be focusing on the HETATM record. Note that the ATOM record is specifically reserved for the standard amino acid and nucleotide monomers in the PDB monomer databse, which we will not be discussing here. Additionally, PDB files do not usually contain bonds, so we will be skipping the CONECT record as well.

The HETATM record format is as follows:

COLUMNS DATA TYPE FIELD DEFINITION
1 – 6 Record name “HETATM”
7 – 11 Integer serial Atom serial number.
13 – 16 Atom name Atom name.
17 Character altLoc Alternate location indicator.
18 – 20 Residue name resName Residue name.
22 Character chainID Chain identifier.
23 – 26 Integer resSeq Residue sequence number.
27 AChar iCode Code for insertion of residues.
31 – 38 Real(8.3) x Orthogonal coordinates for X.
39 – 46 Real(8.3) y Orthogonal coordinates for Y.
47 – 54 Real(8.3) z Orthogonal coordinates for Z.
55 – 60 Real(6.2) occupancy Occupancy.
61 – 66 Real(6.2) tempFactor Temperature factor.
77 – 78 LString(2) element Element symbol; right-justified.
79 – 80 LString(2) charge Charge on the atom.

The corresponding Fortran Format specification string is this:

“(A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,3X,3F8.3,2F6.2,10X,2A2)”
DATA TYPE Record Name Integer Atom Character Residue Name Character Integer AChar 3xReal(8.3) 2xReal(6.2) 2xLString(2)
Fortran Edit Descriptor A6 I5 A4 A1 A3 A1 I4 A1 3F8.3 2F6.2 2A2

Note that there are unused columns in the PDB specification, so those columns are ignored by using the X Fortran Format edit descriptor, preceded by the number of columns to skip.

The following is all the code necessary to read and write PDB files!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
// create the FortranFormat object (it is only created once and used many times)
static FortranFormat formatter;

static{
    //set up formatter and set to automatically add carriage returns to the end of lines when formatting
    try {
        formatter = new FortranFormat("(A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,3X,3F8.3,2F6.2,10X,2A2)");
    } catch (ParseException e) {
        e.printStackTrace();
    }
    formatter.setAddReturn(true);
}

// function to read PDB files, returns a Vector of Atoms, requires a BufferedReader wrapping some input source (maybe a FileReader or an InputStreamReader)
Vector<Atom> parsePDB(BufferedReader br){
    // Vector of Atoms to be returned
    Vector<Atom> atoms = new Vector<Atom>();
    String line = null;
    // read lines from the BufferedReader
    while((line = br.readLine())!=null){
        // check if the line is an atom record
        if(line.startsWith("HETATM")){
            // use FortranFormat to parse the line, avoiding the tedious substrings and casts!
            Vector<Object> objects = formatter.parse(line);
            // create the Atom object from the objects in the Vector, they are in an order identical to the Fortran Format specification string
            Atom a = new Atom();
            // set up the Atom object
            // notice that Java 1.5 Autounboxing allows us to quickly assign the object values to the primitive values
            a.x = (Double)objects.get(8);
            a.y = (Double)objects.get(9);
            a.z = (Double)objects.get(10);
            a.symbol = (String)objects.get(13);
            // charge is a String with a "+/-" in front of it, so we will have to manually handle the token
            String charge = (String)objects.get(14);
            boolean isNegative = charge.startsWith("-");
            a.charge = (isNegative?-1:1)*Integer.parseInt(Character.toString(charge.charAt(1)));
        }
    }
    return atoms;
}

// writes a PDB file to a BufferedWriter describing the atoms in the Vector parameter
void writePDB(BufferedWriter bw, Vector<Atom> atoms){
    ...
    // write a header to the BufferedWriter
    ...
    // create the Vector of objects outside the loop to reduce overhead
    Vector<Object> objects = new Vector<Object>(15);
    for(Atom a : atoms){
        // clear the Vector of the previous iteration's objects
        objects.clear();
        // add the atom's data to the Vector in the same order that they appear in the Fortran Format specification string
        // notice that Java 1.5 Autoboxing allows us to quickly place primitive values into the array
        objects.add("HETATM");
        objects.add(atoms.indexOf(a) + 1);
        // this data is left-aligned in the PDB specification (Fortran Format is right-aligned; FortranFormat has an option to alter alignment globally for when the need arises)
        objects.add(a.symbol + (a.symbol.length()==1?"   ":"  "));
        objects.add("");
        objects.add("UNK");
        objects.add("");
        objects.add(0);
        objects.add(null);
        objects.add(a.x);
        objects.add(a.y);
        objects.add(a.z);
        // FortranFormat accepts both Floats and Doubles for Reals, however it only returns Doubles
        objects.add(0f);
        objects.add(0f);
        objects.add(a.symbol);
        // again, we must format the charge correctly first since it is not a simple Integer
        boolean isNegative = a.charge < 0;
        objects.add((isNegative ? "" : "+")+Integer.toString(a.charge));
        // append the return
        bw.append(formatter.format(objects));
    }
    ...
    // write other information to the BufferedWriter
    ...
}

And that’s it! You can now read and write PDB atom tables with a minimal amount of code all while avoiding the tedious parsing involved.

SMD

SMD is not as popular as PDB (or MDL formats for that matter), but is an interesting example as Java programmers cannot interpret or generate SMD files without FortranFormat. This is because parsing of SMD connection tables is dependent on Fortran Format specification strings embedded in them. To learn more about SMD files, please read the publication by Bebak, Buse, Donner, Hoever, Jacob, Klaus, Pesch, Roemelt, Schilling, Woost and Zirz.

Observe the following excerpt from this publication:

We will examine the two most important blocks from the specification, the CT (Connection Table) and CO (Coordinate) blocks. Notice that Fortran Format specification strings are an integral part of these blocks (“(A2,5I2)”, “(6I2)” and “(3I12)”). The blurbs to the right state that the Fortran Format strings are optional, but the file would be completely ambiguous without them as there is no specified default format. Since Java does not support Fortran Formatting (until now!), this filetype would be a nightmare to parse! Using FortranFormat, we simply find the Fortran Format specification string, create a FortranFormat object with that string, and let that object do all the work.

The following is all the code necessary to read and write SMD files! It is more complex than the code for PDB files.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
// function to read SMD files, returns a Molecule, requires a BufferedReader wrapping some input source (maybe a FileReader or an InputStreamReader)
Molecule parseSMD(BufferedReader br){
    // Vector of Atoms to be included in the Molecule object
    Vector<Atom> atoms;
    // Vector of Bonds to be included in the Molecule object
    Vector<Bond> bonds;
    String line = null;
    // read lines from the BufferedReader
    while((line = br.readLine())!=null){
        // check if the line is the beginning of the CT or CO blocks
        if(line.startsWith(">CT")){
            // observe the following line
            line = br.readLine();
            // tokenize the line to extract the number of atom records, number of bond records, and the corresponding Fortran Format specification strings
            StringTokenizer st = new StringTokenizer(line);
            int numberOfAtoms = Integer.parseInt(st.nextToken());
            int numberOfBonds = Integer.parseInt(st.nextToken());
            String atomSpecificationString = st.nextToken();
            String bondSpecificationString = st.nextToken();
            // set up Atom and Bond Vectors, specifying sizes to reduce overhead
            atoms = new Vector<Atom>(numberOfAtoms);
            bonds = new Vector<Bond>(numberOfBonds);
            // set up FortranFormat objects
            FortranFormat atomParser = new FortranFormat(atomSpecificationString);
            FortranFormat bondParser = new FortranFormat(atomSpecificationString);
            // create atoms
            for(int i = 0; i<numberOfAtoms; i++){
                Vector<Objects> objects = atomParser.parse(br.readLine());
                Atom a = new Atom();
                a.symbol = (String)objects.get(0);
                a.charge = (Integer)objects.get(3);
            }
            // create bonds
            for(int i = 0; i< numberOfBonds; i++){
                Vector<Objects> objects = bondParser.parse(br.readLine());
                Bond b = new Bond();
                b.one = atoms.get(((Integer)objects.get(0))-1);
                b.two = atoms.get(((Integer)objects.get(1))-1);
                b.order = (Integer)objects.get(2);
            }
        } else if(line.startsWith(">CO")){
            // observe the following line
            line = br.readLine();
            // tokenize the line to extract the coordinate exponent and the coordinate Fortran Format specification string
            StringTokenizer st = new StringTokenizer(line);
            int exponent = Integer.parseInt(st.nextToken());
            String coordinateSpecificationString = st.nextToken();
            // set up FortranFormat object
            FortranFormat coordinateParser = new FortranFormat(coordinateSpecificationString);
            // set atom coordinates
            for(int i = 0; i<numberOfAtoms; i++){
                Vector<Objects> objects = coordinateParser.parse(br.readLine());
                a.x = (Integer) v.get(0) / Math.pow(10, exponent);
                a.y = (Integer) v.get(1) / Math.pow(10, exponent);
                a.z = (Integer) v.get(2) / Math.pow(10, exponent);
            }
        }
    }
    Molecule m = new Molecule();
    m.atoms = atoms;
    m.bonds = bonds;
    return m;
}

// writes a SMD file to a BufferedWriter describing the Molecule parameter
void writeSMD(BufferedWriter bw, Molecule m){
    ...
    // write a header to the BufferedWriter
    ...
    // begin the CT block
    bw.append(">CT Example "+m.atoms.size()+m.bonds.size()+1+"\n");
    // we will use the same Fortran Format specification strings, but you can choose any describing the same data
    String atomSpecificationString = "(A2,5I2)";
    String bondSpecificationString = "(6I2)";
    // set up FortranFormat objects
    FortranFormat atomParser = new FortranFormat(atomSpecificationString);
    FortranFormat bondParser = new FortranFormat(atomSpecificationString);
    atomParser.setAddReturn(true);
    bondParser.setAddReturn(true);
    bw.append(m.atoms.size()+" "+m.bonds.size()+" "+ atomSpecificationString +" "+ bondSpecificationString+"\n");
    // create the Vector of objects outside the loop to reduce overhead (for atoms)
    Vector<Object> objects = new Vector<Object>(6);
    for(Atom a : m.atoms){
        // clear the Vector of the previous iteration's objects
        objects.clear();
        // add the atom's data to the Vector in the same order that they appear in the Fortran Format specification string
        objects.add(c.label);
        objects.add(0);
        objects.add(0);
        objects.add(c.charge);
        objects.add(0);
        objects.add(0);
        // append the return
        bw.append(formatter.format(objects));
    }
    for(Bond b : m.bonds){
        // clear the Vector of the previous iteration's objects
        objects.clear();
        // add the bond's data to the Vector in the same order that they appear in the Fortran Format specification string
        objects.add(atoms.indexOf(b.one)+1);
        objects.add(atoms.indexOf(b.two)+1);
        objects.add(b.order);
        objects.add(0);
        objects.add(0);
        objects.add(0);
        // append the return
        bw.append(formatter.format(objects));
    }
    // begin the CO block
    bw.append(">CO ANGSTROEMS "+m.atoms.size()+1+"\n");
    // we will use the same Fortran Format specification strings, but you can choose any describing the same data
    String coordinateSpecificationString = "(3I12)";
    // set up FortranFormat object
    FortranFormat coordinateParser = new FortranFormat(atomSpecificationString);
    coordinateParser.setAddReturn(true);
    bw.append(3+" "+ coordinateSpecificationString+"\n");
    // create the Vector of objects outside the loop to reduce overhead (for atoms)
    Vector<Object> objects = new Vector<Object>(3);
    for(Atom a : m.atoms){
        // clear the Vector of the previous iteration's objects
        objects.clear();
        // add the atom's data to the Vector in the same order that they appear in the Fortran Format specification string
        objects.add((int)Math.round(a.x * 1000));
        objects.add((int)Math.round(a.y * 1000));
        objects.add((int)Math.round(a.z * 1000));
        // append the return
        bw.append(formatter.format(objects));
    }
    ...
    // write other information to the BufferedWriter
    ...
}

So there it is, the code to read and write both PDB and SMD files using FortranFormat. I hope this library is very useful to many, and please let us know if it helps you.

As an aside, ChemDoodle v1.5 is still on track and the new features are incredible. Several new chemical file types have been introduced, including PDB and SMD files. All customers will receive this update for free! We know you will enjoy it.

FortranFormat is an open source Java library for parsing input and formatting output using Fortran Format specification. FortranFormat is written by Kevin Theisen and adheres to the coding standards and style of Java 1.5, complete with JavaDoc commenting. FortranFormat is very useful for those wishing to port Fortran code to Java, especially in academic environments, and also for parsing and generating files that conform to strict column widths.

To learn more about FortranFormat or obtain a copy, please visit the official FortranFormat website.