iChemLabs | News > Reading and Writing PDB and SMD Files: An Exercise with FortranFormat

I recently released an open source Java library for parsing input and formatting output using Fortran Format. This library is very valuable for chemoinformaticists as many chemical file formats are easily parsed and generated with Fortran Format, and some even require it! You can learn more about the library, FortranFormat, at its official webpage, so I will not go into the details. Briefly, FortranFormat allows Java programmers to easily input and output data to and from strict column width formats.

This story will help to answer the following questions:

How do I read or write a PDB file?
How do I read or write a SMD file?
How do I use FortranFormat?

There are several great reasons for adding FortranFormat to your parsing library. It is free and released under the Berkeley Software Distribution (BSD) license. It significantly reduces the amount of code you will need to tediously write, as we will see. You may use the same FortranFormat object to both read and write data lines, so you can create it once statically and reduce overhead. And, maybe most importantly, it conveniently works with the Autoboxing/Unboxing feature incorporated into Java 1.5.

In this story I will demonstrate how to use FortranFormat to parse and generate two chemical file formats: (1) Protein Data Bank (PDB) and (2) Standard Molecular Data 4.2 (SMD). PDB format is very popular for 3-dimensional data, while SMD format is very similar to other standard connection tables, but is defined with Fortran Format. Please note that I did not test the following code, so it may need some minor tweaking.

To begin, I will define three data structures: Atom, Bond and Molecule. They are self-explanatory. I will keep the code light and avoid more robust practices (using private members, getters and setters, error handling, constructors, etc.) for simplicity.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

class Atom{
public double x, y, z;
public int charge;
public String symbol;
}

class Bond{
public Atom one, two;
public int order;
}

class Molecule{
public Vector<Atom> atoms;
public Vector<Bond> bonds;
}

PDB

Let us start with PDB. The structure of a PDB file can be as simple or complex as you want, however, we will consider a basic example with just a header and an atom table. To learn more about the PDB format, please visit the RCSB Protein Data Bank co-managed by my alma mater, Rutgers. The entire PDB specification is quite large, at over 200 pages, with dozens of different record types, but we will be focusing on the HETATM record. Note that the ATOM record is specifically reserved for the standard amino acid and nucleotide monomers in the PDB monomer databse, which we will not be discussing here. Additionally, PDB files do not usually contain bonds, so we will be skipping the CONECT record as well.

The HETATM record format is as follows:

COLUMNS	DATA TYPE	FIELD	DEFINITION
1 – 6	Record name	“HETATM”
7 – 11	Integer	serial	Atom serial number.
13 – 16	Atom	name	Atom name.
17	Character	altLoc	Alternate location indicator.
18 – 20	Residue name	resName	Residue name.
22	Character	chainID	Chain identifier.
23 – 26	Integer	resSeq	Residue sequence number.
27	AChar	iCode	Code for insertion of residues.
31 – 38	Real(8.3)	x	Orthogonal coordinates for X.
39 – 46	Real(8.3)	y	Orthogonal coordinates for Y.
47 – 54	Real(8.3)	z	Orthogonal coordinates for Z.
55 – 60	Real(6.2)	occupancy	Occupancy.
61 – 66	Real(6.2)	tempFactor	Temperature factor.
77 – 78	LString(2)	element	Element symbol; right-justified.
79 – 80	LString(2)	charge	Charge on the atom.

The corresponding Fortran Format specification string is this:

“(A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,3X,3F8.3,2F6.2,10X,2A2)”

DATA TYPE	Record Name	Integer	Atom	Character	Residue Name	Character	Integer	AChar	3xReal(8.3)	2xReal(6.2)	2xLString(2)
Fortran Edit Descriptor	A6	I5	A4	A1	A3	A1	I4	A1	3F8.3	2F6.2	2A2

Note that there are unused columns in the PDB specification, so those columns are ignored by using the X Fortran Format edit descriptor, preceded by the number of columns to skip.

The following is all the code necessary to read and write PDB files!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79

// create the FortranFormat object (it is only created once and used many times)
static FortranFormat formatter;

static{
//set up formatter and set to automatically add carriage returns to the end of lines when formatting
try {
formatter = new FortranFormat("(A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,3X,3F8.3,2F6.2,10X,2A2)");
} catch (ParseException e) {
e.printStackTrace();
}
formatter.setAddReturn(true);
}

// function to read PDB files, returns a Vector of Atoms, requires a BufferedReader wrapping some input source (maybe a FileReader or an InputStreamReader)
Vector<Atom> parsePDB(BufferedReader br){
// Vector of Atoms to be returned
Vector<Atom> atoms = new Vector<Atom>();
String line = null;
// read lines from the BufferedReader
while((line = br.readLine())!=null){
// check if the line is an atom record
if(line.startsWith("HETATM")){
// use FortranFormat to parse the line, avoiding the tedious substrings and casts!
Vector<Object> objects = formatter.parse(line);
// create the Atom object from the objects in the Vector, they are in an order identical to the Fortran Format specification string
Atom a = new Atom();
// set up the Atom object
// notice that Java 1.5 Autounboxing allows us to quickly assign the object values to the primitive values
a.x = (Double)objects.get(8);
a.y = (Double)objects.get(9);
a.z = (Double)objects.get(10);
a.symbol = (String)objects.get(13);
// charge is a String with a "+/-" in front of it, so we will have to manually handle the token
String charge = (String)objects.get(14);
boolean isNegative = charge.startsWith("-");
a.charge = (isNegative?-1:1)*Integer.parseInt(Character.toString(charge.charAt(1)));
}
}
return atoms;
}

// writes a PDB file to a BufferedWriter describing the atoms in the Vector parameter
void writePDB(BufferedWriter bw, Vector<Atom> atoms){
...
// write a header to the BufferedWriter
...
// create the Vector of objects outside the loop to reduce overhead
Vector<Object> objects = new Vector<Object>(15);
for(Atom a : atoms){
// clear the Vector of the previous iteration's objects
objects.clear();
// add the atom's data to the Vector in the same order that they appear in the Fortran Format specification string
// notice that Java 1.5 Autoboxing allows us to quickly place primitive values into the array
objects.add("HETATM");
objects.add(atoms.indexOf(a) + 1);
// this data is left-aligned in the PDB specification (Fortran Format is right-aligned; FortranFormat has an option to alter alignment globally for when the need arises)
objects.add(a.symbol + (a.symbol.length()==1?" ":" "));
objects.add("");
objects.add("UNK");
objects.add("");
objects.add(0);
objects.add(null);
objects.add(a.x);
objects.add(a.y);
objects.add(a.z);
// FortranFormat accepts both Floats and Doubles for Reals, however it only returns Doubles
objects.add(0f);
objects.add(0f);
objects.add(a.symbol);
// again, we must format the charge correctly first since it is not a simple Integer
boolean isNegative = a.charge < 0;
objects.add((isNegative ? "" : "+")+Integer.toString(a.charge));
// append the return
bw.append(formatter.format(objects));
}
...
// write other information to the BufferedWriter
...
}

And that’s it! You can now read and write PDB atom tables with a minimal amount of code all while avoiding the tedious parsing involved.

SMD

SMD is not as popular as PDB (or MDL formats for that matter), but is an interesting example as Java programmers cannot interpret or generate SMD files without FortranFormat. This is because parsing of SMD connection tables is dependent on Fortran Format specification strings embedded in them. To learn more about SMD files, please read the publication by Bebak, Buse, Donner, Hoever, Jacob, Klaus, Pesch, Roemelt, Schilling, Woost and Zirz.

Observe the following excerpt from this publication:

We will examine the two most important blocks from the specification, the CT (Connection Table) and CO (Coordinate) blocks. Notice that Fortran Format specification strings are an integral part of these blocks (“(A2,5I2)”, “(6I2)” and “(3I12)”). The blurbs to the right state that the Fortran Format strings are optional, but the file would be completely ambiguous without them as there is no specified default format. Since Java does not support Fortran Formatting (until now!), this filetype would be a nightmare to parse! Using FortranFormat, we simply find the Fortran Format specification string, create a FortranFormat object with that string, and let that object do all the work.

The following is all the code necessary to read and write SMD files! It is more complex than the code for PDB files.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132

// function to read SMD files, returns a Molecule, requires a BufferedReader wrapping some input source (maybe a FileReader or an InputStreamReader)
Molecule parseSMD(BufferedReader br){
// Vector of Atoms to be included in the Molecule object
Vector<Atom> atoms;
// Vector of Bonds to be included in the Molecule object
Vector<Bond> bonds;
String line = null;
// read lines from the BufferedReader
while((line = br.readLine())!=null){
// check if the line is the beginning of the CT or CO blocks
if(line.startsWith(">CT")){
// observe the following line
line = br.readLine();
// tokenize the line to extract the number of atom records, number of bond records, and the corresponding Fortran Format specification strings
StringTokenizer st = new StringTokenizer(line);
int numberOfAtoms = Integer.parseInt(st.nextToken());
int numberOfBonds = Integer.parseInt(st.nextToken());
String atomSpecificationString = st.nextToken();
String bondSpecificationString = st.nextToken();
// set up Atom and Bond Vectors, specifying sizes to reduce overhead
atoms = new Vector<Atom>(numberOfAtoms);
bonds = new Vector<Bond>(numberOfBonds);
// set up FortranFormat objects
FortranFormat atomParser = new FortranFormat(atomSpecificationString);
FortranFormat bondParser = new FortranFormat(atomSpecificationString);
// create atoms
for(int i = 0; i<numberOfAtoms; i++){
Vector<Objects> objects = atomParser.parse(br.readLine());
Atom a = new Atom();
a.symbol = (String)objects.get(0);
a.charge = (Integer)objects.get(3);
}
// create bonds
for(int i = 0; i< numberOfBonds; i++){
Vector<Objects> objects = bondParser.parse(br.readLine());
Bond b = new Bond();
b.one = atoms.get(((Integer)objects.get(0))-1);
b.two = atoms.get(((Integer)objects.get(1))-1);
b.order = (Integer)objects.get(2);
}
} else if(line.startsWith(">CO")){
// observe the following line
line = br.readLine();
// tokenize the line to extract the coordinate exponent and the coordinate Fortran Format specification string
StringTokenizer st = new StringTokenizer(line);
int exponent = Integer.parseInt(st.nextToken());
String coordinateSpecificationString = st.nextToken();
// set up FortranFormat object
FortranFormat coordinateParser = new FortranFormat(coordinateSpecificationString);
// set atom coordinates
for(int i = 0; i<numberOfAtoms; i++){
Vector<Objects> objects = coordinateParser.parse(br.readLine());
a.x = (Integer) v.get(0) / Math.pow(10, exponent);
a.y = (Integer) v.get(1) / Math.pow(10, exponent);
a.z = (Integer) v.get(2) / Math.pow(10, exponent);
}
}
}
Molecule m = new Molecule();
m.atoms = atoms;
m.bonds = bonds;
return m;
}

// writes a SMD file to a BufferedWriter describing the Molecule parameter
void writeSMD(BufferedWriter bw, Molecule m){
...
// write a header to the BufferedWriter
...
// begin the CT block
bw.append(">CT Example "+m.atoms.size()+m.bonds.size()+1+"\n");
// we will use the same Fortran Format specification strings, but you can choose any describing the same data
String atomSpecificationString = "(A2,5I2)";
String bondSpecificationString = "(6I2)";
// set up FortranFormat objects
FortranFormat atomParser = new FortranFormat(atomSpecificationString);
FortranFormat bondParser = new FortranFormat(atomSpecificationString);
atomParser.setAddReturn(true);
bondParser.setAddReturn(true);
bw.append(m.atoms.size()+" "+m.bonds.size()+" "+ atomSpecificationString +" "+ bondSpecificationString+"\n");
// create the Vector of objects outside the loop to reduce overhead (for atoms)
Vector<Object> objects = new Vector<Object>(6);
for(Atom a : m.atoms){
// clear the Vector of the previous iteration's objects
objects.clear();
// add the atom's data to the Vector in the same order that they appear in the Fortran Format specification string
objects.add(c.label);
objects.add(0);
objects.add(0);
objects.add(c.charge);
objects.add(0);
objects.add(0);
// append the return
bw.append(formatter.format(objects));
}
for(Bond b : m.bonds){
// clear the Vector of the previous iteration's objects
objects.clear();
// add the bond's data to the Vector in the same order that they appear in the Fortran Format specification string
objects.add(atoms.indexOf(b.one)+1);
objects.add(atoms.indexOf(b.two)+1);
objects.add(b.order);
objects.add(0);
objects.add(0);
objects.add(0);
// append the return
bw.append(formatter.format(objects));
}
// begin the CO block
bw.append(">CO ANGSTROEMS "+m.atoms.size()+1+"\n");
// we will use the same Fortran Format specification strings, but you can choose any describing the same data
String coordinateSpecificationString = "(3I12)";
// set up FortranFormat object
FortranFormat coordinateParser = new FortranFormat(atomSpecificationString);
coordinateParser.setAddReturn(true);
bw.append(3+" "+ coordinateSpecificationString+"\n");
// create the Vector of objects outside the loop to reduce overhead (for atoms)
Vector<Object> objects = new Vector<Object>(3);
for(Atom a : m.atoms){
// clear the Vector of the previous iteration's objects
objects.clear();
// add the atom's data to the Vector in the same order that they appear in the Fortran Format specification string
objects.add((int)Math.round(a.x * 1000));
objects.add((int)Math.round(a.y * 1000));
objects.add((int)Math.round(a.z * 1000));
// append the return
bw.append(formatter.format(objects));
}
...
// write other information to the BufferedWriter
...
}

So there it is, the code to read and write both PDB and SMD files using FortranFormat. I hope this library is very useful to many, and please let us know if it helps you.

As an aside, ChemDoodle v1.5 is still on track and the new features are incredible. Several new chemical file types have been introduced, including PDB and SMD files. All customers will receive this update for free! We know you will enjoy it.

FortranFormat is an open source Java library for parsing input and formatting output using Fortran Format specification. FortranFormat is written by Kevin Theisen and adheres to the coding standards and style of Java 1.5, complete with JavaDoc commenting. FortranFormat is very useful for those wishing to port Fortran code to Java, especially in academic environments, and also for parsing and generating files that conform to strict column widths.

To learn more about FortranFormat or obtain a copy, please visit the official FortranFormat website.

ChemDoodle 2D

ChemDoodle 3D

ChemDoodle Mobile

ChemDoodle Web Components

ChemStack

Reading and Writing PDB and SMD Files: An Exercise with FortranFormat

PDB

SMD

End Users