This article is the second part of a three-part series on chemical data recovery written by Kevin Theisen, President of iChemLabs:

  1. Embedded Chemical Data Recovery
  2. Chemical Image Recovery
  3. Legacy Chemical Data Recovery

We launched ChemDoodle 2D v11.4 on April 2nd, 2021. A new chemical image recovery function was included for automatically rebuilding chemical drawings from an image, which this article discusses in detail.

Image

In this laboratory setting, an android is using its ability to see and understand molecule drawings and communicate with a scientist. New chemical image recovery features in ChemDoodle 2D make this future a possibility.

What is chemical image recovery?

When we communicate as chemists, we often use images of molecules because a picture is the most effective way to communicate information to visual creatures such as us. For well over a century, images of molecules have been created for use in documents, databases, notebooks, websites, etc. But an image is just an image, we cannot do anything more than look at it. Maybe we can enlarge it, copy it or print it, but the original chemical data it represents is only realized in our minds.

Chemical image recovery (CIR) is the process of taking an image of a chemical drawing, with no provided information other than the defined pixels, and using a computer to recreate the original chemical data to be used or edited further. For instance, take the following image of galanthamine. The image on the left is the input image, and the image on the right is the result of the CIR function in ChemDoodle 2D.

Image
An image of the molecule galanthamine
Image
The recovered chemical drawing of galanthamine

The first impression may be, "Great! I now have a less blurry image." Yet, the result is much more significant. The actual chemical data, the arrangement of atoms and bonds, is digitized. We can now further process this information. For instance, we can produce a molecular formula, resolve the CIP stereochemical configurations, and change the graphical style to ACS 1996. We may even optimize the molecular structure in 3D and calculate a distance for the hydrogen bond. All of this output is easily produced from the result of the CIR function on the input image. Without the CIR function, a person would be required to redraw this molecular structure in a program to perform any computational chemistry task. One image is work by itself, imagine having to transcribe thousands of chemical drawings.

Image
The recovered chemical drawing is further processed, we are able to change styles, resolve stereochemical configurations and produce a molecular formula
Image
Even further processing on the recovered chemical drawing, optimizing a 3D structure using the MMFF94 force field and measuring the hydrogen bond

By recreating the chemical data originally lost in images, CIR makes it possible to produce many solutions for scientists. You can have a program automatically catalog drawings from laboratory notebooks. Students can simply point their camera at a chemical structure on a poster and get the associated IUPAC name. An assistive tool can be produced to help vision-impaired chemists. A researcher can snap a picture of a molecule from a publication he/she is reading and immediately find more information from a chemical search engine. We may even be able to produce androids for our labs with the ability to observe and understand chemical drawings and then complement scientists so they can get their work done faster. The android could also protect the chemist if safety becomes a concern.

If you would like to try ChemDoodle's CIR function, you may use the File>Recover from Image... menu item in ChemDoodle 2D. You may also use the ChemDoodle Web Components CIR demo here.

Background

Chemical image recovery is not a new concept. In fact, a few solutions already exist, known in the cheminformatics industry as Optical Structure Recognition (OSR) tools. I am not a fan of this name. It has been called OSR because an algorithm called Optical Character Recognition (OCR, for the computer recognition of character glyphs) already exists, and since we cannot call it Optical Chemical Recognition because the abbreviation is the same, "chemical" is replaced with "structure". This is unfortunate. If you say "Optical Structure Recognition" to a chemist, it will be very unlikely they know what you are referring to. "Chemical Image Recovery" allows a chemist to get closer to the meaning; we are attempting to recover the chemical information from an image.

Probably the earliest, most impactful literature for CIR algorithms is the Optical Recognition Of Chemical Graphics paper out of IBM Almaden in 1993. Their work began in 1988 and they outline a general procedure for a CIR algorithm: you break down the input pixels into shape-based features, after which you build the chemical information from the ground up by interpreting those shapes. I call this procedural CIR. Since the IBM Almaden paper, many other procedural CIR solutions have come and gone. In 2008, NIH OSRA was introduced, providing an open source (GPLv2+) CIR solution, and is the most well known CIR solution today. More recently, machine learning (ML) has matured and become more applicable to solving problems. ML CIR may also be effective.

So this brings us to the present, and at iChemLabs we are building our own CIR solution. Since OSRA already exists, some may ask "This is good, why reinvent the wheel?". I certainly encourage everyone to make use of open source resources and contribute to those projects. At iChemLabs, we also produce open source projects. But I think we can build something better in the ChemDoodle ecosystem where OSRA is not an ideal solution because (a) the license is not compatible, and (b) OSRA is a C++ tool, while ChemDoodle is Java. I would also say, "If you want to create, just create!". Just have fun creating, and that is what iChemLabs excels at. In the next section, I go over our algorithm and how we are attempting to create the best CIR solution.

Implementation

We developed a procedural CIR tool. Machine learning is certainly impressive, but it requires a high throughput of data to be generally applicable. Our algorithm should handle chemical images as generically as possible, without having seen a similar style of image. The procedural approach is optimal. Let's take a closer look at the example from the introduction.

Image
The input image of the molecule galanthamine

The first step is to categorize and normalize the image, to recognize what must be done to understand the graphics as a chemical drawing. Screenshots of a chemical drawing from a publication at different scales need to all be handled differently and will be very different from a high resolution picture of the same chemical drawing taken on your phone's camera. We invented a very creative solution allowing ChemDoodle to see the molecule in its entirety and make decisions about handling it from the image pixels before any processing begins, similar to how a human would be able to look at all these different images and recognize the molecule drawings within. The image is then intelligently normalized using image scaling, binarization and thinning functions.

Image
The result of analyzing the image and normalizing it for digestion

A custom vectorization is employed to break the features into shapes. Those shapes are then analyzed to further break them down while grouping and categorizing.

Image
The pixels are separated and characterized into shapes

All procedural CIR algorithms perform these steps, with some level of success. The remainder of the algorithm is the most important and ChemDoodle CIR excels here. The interpretation of the shapes is not trivial. Take a look at the two shapes pointed out by arrows. The top one looks like a fork coming off of a complex ring system and the bottom one looks like bent arm with a hand. How are these to be perceived? Our goal was to produce an algorithm to match how a human's mind would perceive a chemical drawing, and mimic those decisions. Mathematical models are used for all of the interpretation, we do not rely on any arbitrary distance comparisons. Relying on distances is where most CIR algorithms fail, because there is no standard defined chemical structure distance, and every image may be unique in this aspect, from size and resolution, to atom label spacing to bond thickness to object congestion.

Image
The perceived bonds
Image
The perceived atom labels

Everything is pieced back together to recover the original drawing. The final structure is then analyzed in a chemistry sense for any flaws. This is another step where the ChemDoodle algorithm excels, as ChemDoodle is one of the best and most thorough cheminformatics systems in existence.

Image
The recovered chemical drawing of galanthamine

One last unique thing about our implementation. We wrote the entirety of the algorithm from scratch and did not rely on any 3rd party libraries. Image processing, thinning, scaling, normalization, vectorization, optical character recognition, our mathematical models, etc. were all developed in-house at iChemLabs. This allows us to specifically focus any part of the process to chemical drawing information. Another upside is the absence of any license obligations and restrictions. Many existing CIR tools are dependent on Microsoft's proprietary OCR products. All you need to run ChemDoodle's CIR tool is ChemDoodle.

Performance

Our goal is to generically handle any image of a chemical drawing. To evaluate our success, we created a large testing suite of random images of chemical drawings from various sources: the internet, articles, books, cameras, software and more. Here is a selection of some of the varied images. ChemDoodle's CIR algorithm handles all these images perfectly.

Image
A sample of the testing suite variety used to evaluate ChemDoodle's CIR algorithm

Our initial goal was to handle 100 of these random images perfectly, which we achieved. But we didn't want to just handle complex images, we also wanted to recognize simplistic edge cases other CIR tools would overlook. Take the following examples. Again, ChemDoodle's CIR algorithm handles all of them perfectly.

Image
Simplistic tests often overlooked by other CIR algorithms

To summarize, ChemDoodle is able to recognize complex atom labels consisting of elements, numbers, abbreviations, formulae and more. Text may be formatted including superscripts and subscripts. Bonds will be detected and overlapping bonds resolved, both where there is a break in the bond and those where the two bonds cross. Single through sextuple, as well as wedges and bold bond types are understood. Rings will be automatically created based on perceived atoms and bonds including the interpretation of aromatic circles. Charges, radicals and isotopes are recognized. ChemDoodle is also able to identify isolated non-chemical text and output it as a label shape. Multiple structures in the same image will be properly handled, even if they have different styles.

Runtime is also an important consideration. ChemDoodle's CIR algorithm processes the galanthamine image with an average runtime of 76.4ms. The most time intensive part of the algorithm is optical character recognition. The longest runtime we have found is for an image we created with 13 complex atom labels resulting in an average runtime of 115.1ms. So expect the performance to scale with the amount of text in the image requiring recognition. All benchmarks were performed on a 2017 iMac running macOS 11.2.2 with a 4.2 GHz Quad-Core Intel Core i7 CPU. Each image was recovered 20 times, with the first iteration disregarded as a warm up. The remaining 19 iterations were averaged. Java version 11.0.8 was used to compile and run the tests.

Finally, there are some limitations, as to be expected with a CIR project. CIR will work on computer-generated, skeletal images of chemical structures. Hand-drawn images may not work well. Clearer images at a crisp resolution will have the best results. The messier or blurrier the image is, the more ChemDoodle will have to use intuition to resolve the chemical structure, similar to a human looking at an unclear image. ChemDoodle may interpret graphics differently than you may. Our goal is to have the CIR features perfectly match what you perceive in the image, but you should expect to perform some level of post-editing on some images.

The Future

I hope this discussion has provided an in-depth view into our current CIR work in ChemDoodle. The initial results are excellent and ChemDoodle's CIR out-performs many competing CIR solutions. We really hope this feature will help eliminate the effort you spend transcribing chemical structures from images. If you are not happy with the results, please send us the image so we may improve the algorithm. We will continuously develop it. As always, ChemDoodle subscribers, Site and Lifetime licensees are entitled to our latest ChemDoodle features, and our customers will continue to benefit from our work. Thank you for your support, as you make these projects possible.

Moving forward, our focus includes the recognition of more bond types, the ability to dissect overlapping features (such as a bad graphic where a bond incorrectly intersects an atom label), and the perception of reaction arrows.

And if you are trying to create androids with the ability to see and perceive chemical drawings, please reach out. We would be happy to work with you! I am looking forward to seeing what our partners build with this capability.