Extracting data from MedChem patent?


I’d like to create a work flow to extract both molecules and tables containing the corresponding IC50s from a patent. I’m very new at Knime so very sorry if this is trivial.

  1. How do I extract the example numbers? I’m able to extract the molecules but not the corresponding example number from the patent. I tried both SDF file reader (sdf generated from Scifinder) and File reader (sureCHEMBL for the csv).
  2. How do you extract a table of data from a patent? Columns: example #, IC50 target A, IC50 target B. I try to use the PDF with the PDF reader. The output is not great. I need more guidance please.
  3. I want to create a table that would have the molecule (image) and both IC50s. I guess I would use a joiner.

Thanks a lot for your assistance!


Welcome Pinkcellou!
Not the simplest of tasks to start with. Have you checked to see if surechembl.org haven’t already extracted the data?
I don’t have a complete solution, but you might look at these two old workflows to get a feel for what’s involved.
They are very old (KNIME ver 2!), but give you an idea of what it’s possible to do extracting chemical terms and/or structures, but not table data. That’ll be your task!

(the other) Simon


Thanks Simon!

Unfortunately the patent is not referenced in surechembl yet. I tried another patent available on surechembl but I don’t see the example numbers extracted. For the table of data I was able to use Tabula but it has its issues (does not recognize structure).


Hi, maybe you can upload an example of a file here so that people who are outside of the field can view what the structure of the patent document you’re dealing with looks like, to be able to help you better.

