I recently released an open source Java library for parsing input and formatting output using Fortran Format. This library is very valuable for chemoinformaticists as many chemical file formats are easily parsed and generated with Fortran Format, and some even require it! You can learn more about the library, FortranFormat, at its official webpage, so I will not go into the details. Briefly, FortranFormat allows Java programmers to easily input and output data to and from strict column width formats.
This story will help to answer the following questions:
- How do I read or write a PDB file?
- How do I read or write a SMD file?
- How do I use FortranFormat?
There are several great reasons for adding FortranFormat to your parsing library. It is free and released under the Berkeley Software Distribution (BSD) license. It significantly reduces the amount of code you will need to tediously write, as we will see. You may use the same FortranFormat object to both read and write data lines, so you can create it once statically and reduce overhead. And, maybe most importantly, it conveniently works with the Autoboxing/Unboxing feature incorporated into Java 1.5.
In this story I will demonstrate how to use FortranFormat to parse and generate two chemical file formats: (1) Protein Data Bank (PDB) and (2) Standard Molecular Data 4.2 (SMD). PDB format is very popular for 3-dimensional data, while SMD format is very similar to other standard connection tables, but is defined with Fortran Format. Please note that I did not test the following code, so it may need some minor tweaking.
To begin, I will define three data structures: Atom, Bond and Molecule. They are self-explanatory. I will keep the code light and avoid more robust practices (using private members, getters and setters, error handling, constructors, etc.) for simplicity.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | class Atom{ public double x, y, z; public int charge; public String symbol; } class Bond{ public Atom one, two; public int order; } class Molecule{ public Vector<Atom> atoms; public Vector<Bond> bonds; } |
PDB
Let us start with PDB. The structure of a PDB file can be as simple or complex as you want, however, we will consider a basic example with just a header and an atom table. To learn more about the PDB format, please visit the RCSB Protein Data Bank co-managed by my alma mater, Rutgers. The entire PDB specification is quite large, at over 200 pages, with dozens of different record types, but we will be focusing on the HETATM record. Note that the ATOM record is specifically reserved for the standard amino acid and nucleotide monomers in the PDB monomer databse, which we will not be discussing here. Additionally, PDB files do not usually contain bonds, so we will be skipping the CONECT record as well.
The HETATM record format is as follows:
COLUMNS | DATA TYPE | FIELD | DEFINITION | |
1 – 6 | Record name | “HETATM” | ||
7 – 11 | Integer | serial | Atom serial number. | |
13 – 16 | Atom | name | Atom name. | |
17 | Character | altLoc | Alternate location indicator. | |
18 – 20 | Residue name | resName | Residue name. | |
22 | Character | chainID | Chain identifier. | |
23 – 26 | Integer | resSeq | Residue sequence number. | |
27 | AChar | iCode | Code for insertion of residues. | |
31 – 38 | Real(8.3) | x | Orthogonal coordinates for X. | |
39 – 46 | Real(8.3) | y | Orthogonal coordinates for Y. | |
47 – 54 | Real(8.3) | z | Orthogonal coordinates for Z. | |
55 – 60 | Real(6.2) | occupancy | Occupancy. | |
61 – 66 | Real(6.2) | tempFactor | Temperature factor. | |
77 – 78 | LString(2) | element | Element symbol; right-justified. | |
79 – 80 | LString(2) | charge | Charge on the atom. |
The corresponding Fortran Format specification string is this:
DATA TYPE | Record Name | Integer | Atom | Character | Residue Name | Character | Integer | AChar | 3xReal(8.3) | 2xReal(6.2) | 2xLString(2) |
Fortran Edit Descriptor | A6 | I5 | A4 | A1 | A3 | A1 | I4 | A1 | 3F8.3 | 2F6.2 | 2A2 |
Note that there are unused columns in the PDB specification, so those columns are ignored by using the X Fortran Format edit descriptor, preceded by the number of columns to skip.
The following is all the code necessary to read and write PDB files!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | // create the FortranFormat object (it is only created once and used many times) static FortranFormat formatter; static{ //set up formatter and set to automatically add carriage returns to the end of lines when formatting try { formatter = new FortranFormat("(A6,I5,1X,A4,A1,A3,1X,A1,I4,A1,3X,3F8.3,2F6.2,10X,2A2)"); } catch (ParseException e) { e.printStackTrace(); } formatter.setAddReturn(true); } // function to read PDB files, returns a Vector of Atoms, requires a BufferedReader wrapping some input source (maybe a FileReader or an InputStreamReader) Vector<Atom> parsePDB(BufferedReader br){ // Vector of Atoms to be returned Vector<Atom> atoms = new Vector<Atom>(); String line = null; // read lines from the BufferedReader while((line = br.readLine())!=null){ // check if the line is an atom record if(line.startsWith("HETATM")){ // use FortranFormat to parse the line, avoiding the tedious substrings and casts! Vector<Object> objects = formatter.parse(line); // create the Atom object from the objects in the Vector, they are in an order identical to the Fortran Format specification string Atom a = new Atom(); // set up the Atom object // notice that Java 1.5 Autounboxing allows us to quickly assign the object values to the primitive values a.x = (Double)objects.get(8); a.y = (Double)objects.get(9); a.z = (Double)objects.get(10); a.symbol = (String)objects.get(13); // charge is a String with a "+/-" in front of it, so we will have to manually handle the token String charge = (String)objects.get(14); boolean isNegative = charge.startsWith("-"); a.charge = (isNegative?-1:1)*Integer.parseInt(Character.toString(charge.charAt(1))); } } return atoms; } // writes a PDB file to a BufferedWriter describing the atoms in the Vector parameter void writePDB(BufferedWriter bw, Vector<Atom> atoms){ ... // write a header to the BufferedWriter ... // create the Vector of objects outside the loop to reduce overhead Vector<Object> objects = new Vector<Object>(15); for(Atom a : atoms){ // clear the Vector of the previous iteration's objects objects.clear(); // add the atom's data to the Vector in the same order that they appear in the Fortran Format specification string // notice that Java 1.5 Autoboxing allows us to quickly place primitive values into the array objects.add("HETATM"); objects.add(atoms.indexOf(a) + 1); // this data is left-aligned in the PDB specification (Fortran Format is right-aligned; FortranFormat has an option to alter alignment globally for when the need arises) objects.add(a.symbol + (a.symbol.length()==1?" ":" ")); objects.add(""); objects.add("UNK"); objects.add(""); objects.add(0); objects.add(null); objects.add(a.x); objects.add(a.y); objects.add(a.z); // FortranFormat accepts both Floats and Doubles for Reals, however it only returns Doubles objects.add(0f); objects.add(0f); objects.add(a.symbol); // again, we must format the charge correctly first since it is not a simple Integer boolean isNegative = a.charge < 0; objects.add((isNegative ? "" : "+")+Integer.toString(a.charge)); // append the return bw.append(formatter.format(objects)); } ... // write other information to the BufferedWriter ... } |
And that’s it! You can now read and write PDB atom tables with a minimal amount of code all while avoiding the tedious parsing involved.
SMD
SMD is not as popular as PDB (or MDL formats for that matter), but is an interesting example as Java programmers cannot interpret or generate SMD files without FortranFormat. This is because parsing of SMD connection tables is dependent on Fortran Format specification strings embedded in them. To learn more about SMD files, please read the publication by Bebak, Buse, Donner, Hoever, Jacob, Klaus, Pesch, Roemelt, Schilling, Woost and Zirz.
Observe the following excerpt from this publication:
We will examine the two most important blocks from the specification, the CT (Connection Table) and CO (Coordinate) blocks. Notice that Fortran Format specification strings are an integral part of these blocks (“(A2,5I2)”, “(6I2)” and “(3I12)”). The blurbs to the right state that the Fortran Format strings are optional, but the file would be completely ambiguous without them as there is no specified default format. Since Java does not support Fortran Formatting (until now!), this filetype would be a nightmare to parse! Using FortranFormat, we simply find the Fortran Format specification string, create a FortranFormat object with that string, and let that object do all the work.
The following is all the code necessary to read and write SMD files! It is more complex than the code for PDB files.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | // function to read SMD files, returns a Molecule, requires a BufferedReader wrapping some input source (maybe a FileReader or an InputStreamReader) Molecule parseSMD(BufferedReader br){ // Vector of Atoms to be included in the Molecule object Vector<Atom> atoms; // Vector of Bonds to be included in the Molecule object Vector<Bond> bonds; String line = null; // read lines from the BufferedReader while((line = br.readLine())!=null){ // check if the line is the beginning of the CT or CO blocks if(line.startsWith(">CT")){ // observe the following line line = br.readLine(); // tokenize the line to extract the number of atom records, number of bond records, and the corresponding Fortran Format specification strings StringTokenizer st = new StringTokenizer(line); int numberOfAtoms = Integer.parseInt(st.nextToken()); int numberOfBonds = Integer.parseInt(st.nextToken()); String atomSpecificationString = st.nextToken(); String bondSpecificationString = st.nextToken(); // set up Atom and Bond Vectors, specifying sizes to reduce overhead atoms = new Vector<Atom>(numberOfAtoms); bonds = new Vector<Bond>(numberOfBonds); // set up FortranFormat objects FortranFormat atomParser = new FortranFormat(atomSpecificationString); FortranFormat bondParser = new FortranFormat(atomSpecificationString); // create atoms for(int i = 0; i<numberOfAtoms; i++){ Vector<Objects> objects = atomParser.parse(br.readLine()); Atom a = new Atom(); a.symbol = (String)objects.get(0); a.charge = (Integer)objects.get(3); } // create bonds for(int i = 0; i< numberOfBonds; i++){ Vector<Objects> objects = bondParser.parse(br.readLine()); Bond b = new Bond(); b.one = atoms.get(((Integer)objects.get(0))-1); b.two = atoms.get(((Integer)objects.get(1))-1); b.order = (Integer)objects.get(2); } } else if(line.startsWith(">CO")){ // observe the following line line = br.readLine(); // tokenize the line to extract the coordinate exponent and the coordinate Fortran Format specification string StringTokenizer st = new StringTokenizer(line); int exponent = Integer.parseInt(st.nextToken()); String coordinateSpecificationString = st.nextToken(); // set up FortranFormat object FortranFormat coordinateParser = new FortranFormat(coordinateSpecificationString); // set atom coordinates for(int i = 0; i<numberOfAtoms; i++){ Vector<Objects> objects = coordinateParser.parse(br.readLine()); a.x = (Integer) v.get(0) / Math.pow(10, exponent); a.y = (Integer) v.get(1) / Math.pow(10, exponent); a.z = (Integer) v.get(2) / Math.pow(10, exponent); } } } Molecule m = new Molecule(); m.atoms = atoms; m.bonds = bonds; return m; } // writes a SMD file to a BufferedWriter describing the Molecule parameter void writeSMD(BufferedWriter bw, Molecule m){ ... // write a header to the BufferedWriter ... // begin the CT block bw.append(">CT Example "+m.atoms.size()+m.bonds.size()+1+"\n"); // we will use the same Fortran Format specification strings, but you can choose any describing the same data String atomSpecificationString = "(A2,5I2)"; String bondSpecificationString = "(6I2)"; // set up FortranFormat objects FortranFormat atomParser = new FortranFormat(atomSpecificationString); FortranFormat bondParser = new FortranFormat(atomSpecificationString); atomParser.setAddReturn(true); bondParser.setAddReturn(true); bw.append(m.atoms.size()+" "+m.bonds.size()+" "+ atomSpecificationString +" "+ bondSpecificationString+"\n"); // create the Vector of objects outside the loop to reduce overhead (for atoms) Vector<Object> objects = new Vector<Object>(6); for(Atom a : m.atoms){ // clear the Vector of the previous iteration's objects objects.clear(); // add the atom's data to the Vector in the same order that they appear in the Fortran Format specification string objects.add(c.label); objects.add(0); objects.add(0); objects.add(c.charge); objects.add(0); objects.add(0); // append the return bw.append(formatter.format(objects)); } for(Bond b : m.bonds){ // clear the Vector of the previous iteration's objects objects.clear(); // add the bond's data to the Vector in the same order that they appear in the Fortran Format specification string objects.add(atoms.indexOf(b.one)+1); objects.add(atoms.indexOf(b.two)+1); objects.add(b.order); objects.add(0); objects.add(0); objects.add(0); // append the return bw.append(formatter.format(objects)); } // begin the CO block bw.append(">CO ANGSTROEMS "+m.atoms.size()+1+"\n"); // we will use the same Fortran Format specification strings, but you can choose any describing the same data String coordinateSpecificationString = "(3I12)"; // set up FortranFormat object FortranFormat coordinateParser = new FortranFormat(atomSpecificationString); coordinateParser.setAddReturn(true); bw.append(3+" "+ coordinateSpecificationString+"\n"); // create the Vector of objects outside the loop to reduce overhead (for atoms) Vector<Object> objects = new Vector<Object>(3); for(Atom a : m.atoms){ // clear the Vector of the previous iteration's objects objects.clear(); // add the atom's data to the Vector in the same order that they appear in the Fortran Format specification string objects.add((int)Math.round(a.x * 1000)); objects.add((int)Math.round(a.y * 1000)); objects.add((int)Math.round(a.z * 1000)); // append the return bw.append(formatter.format(objects)); } ... // write other information to the BufferedWriter ... } |
So there it is, the code to read and write both PDB and SMD files using FortranFormat. I hope this library is very useful to many, and please let us know if it helps you.
As an aside, ChemDoodle v1.5 is still on track and the new features are incredible. Several new chemical file types have been introduced, including PDB and SMD files. All customers will receive this update for free! We know you will enjoy it.
FortranFormat is an open source Java library for parsing input and formatting output using Fortran Format specification. FortranFormat is written by Kevin Theisen and adheres to the coding standards and style of Java 1.5, complete with JavaDoc commenting. FortranFormat is very useful for those wishing to port Fortran code to Java, especially in academic environments, and also for parsing and generating files that conform to strict column widths.
To learn more about FortranFormat or obtain a copy, please visit the official FortranFormat website.