Image OCR feature request

Hi mikhail,

i see ggas are actively developing an image to molecule package called "image OCR". Are there plans to incorporate this as a knime node. What an amazing functionality this would be to have in knime. It is currently an unmet need. It would allow the retrieval of editable structures from patents and journals from the associated PDFs. User could then manipulate these structures afterwards such as PChem properties, etc.

i am sure this would be a very popular and well used node if it was implemented.

thanks,

simon.

Hi Simon,
 

Currently we work only with molecule images and not with the whole article pages. Current version cannot automatically extract molecules from chemical articles. Is such functionality still be interesting? Could you provide test data? What are the potential scenarios when KNIME integration is essential?

 

Best regards,
Mikhail

 

If it can only deal with an image of a single molecule then this wouldn't be very useful. If it was possible to have a pdf article (i.e. patent or journal) and manually put a box around each molecule and it could pull each of these out as individual molecules into separate rows within a knime table this would be very useful.

It would enable user to then build up an SAR table around a particular receptor target area and look for trends in the molecules through MCS, Murcko's, matched pairs analysis etc. this is where doing this in knime would be really great, convenient, and powerful. 

Simon.

image ocr is cool, if it can be used an add-on for knime. then we can convert between pdf, image and word fluently.

johndoee, Simon asked about a node that can recognize depicted molecule images, and This problem is quite different from widely known text optical recofnition. We are working on molecule OCR and you can find some examples here: http://ggasoftware.com/opensource/imago

I think that generic text OCR could also be quite useful KNIME, but it is not my area.

I've dusted off an old workflow I made for this and checked that it works with the current version of KNIME.

Requirements:
KNIME 2.7.4
Community Nodes - RDKit/Indigo (to view structures) See http://tech.knime.org/community
OSRA 1.4 http://osra.sourceforge.net/

I've posted it over on the myexperiment.org site: http://www.myexperiment.org/workflows/3573.html

The img2structure workflow requires the OSRA structure recognition binaries: http://osra.sourceforge.net/

So you must have a functioning installation of OSRA and it's dependencies. This may require advanced compiler knowledge on your platform, and may not be a trivial task.

The img2structure workflow is incomplete. It is meant to illustrate the potential of KNIME to process PDFs. It currently reads only the first PDF found in the working directory.

The images found in the PDF are aligned with their interpreted structure in the final table. Frequently, these structures contain errors and need to be corrected. It would be great if one of the chemoinformatics packages could provide in-place editing of structures in a table.

To do:
1. Edit-in-place structure correction.
2. Loop through all the PDFs found in the workingdir, not just the first one.

(Also see text2structure workflow that uses KNIME text mining nodes to convert chemical and biological terms found in documents: http://www.myexperiment.org/workflows/3549.html)
 

(the other) Simon

Hi all,

I have attempted to do OCR with Tess4J (A Java JNA wrapper for Tesseract OCR API) but KNIME keeps on crashing on me. I am using a Java snippet and added the necessary jar files from the Tess4J package. See attached workflow that I have being using.

You can download Tess4J source file from http://sourceforge.net/projects/tess4j/

Any assistance would greatly be welcomed. Thanks.