Extracting fuzzy text strings from PDF files

Antoine_Lacour · June 4, 2018, 4:36pm

Hello all,
I’m trying to automate the extraction of a certain string of text from several PDF files.
The string will look like this “1H NMR: xxx .”
This always starts with 1H NMR and finishes with a full stop, and occurs only once in my files. However, the “xxx” part is of variable value and length.
Therefore my thinking was to tell KNIME to extract the text starting at “1” and ending at the next full stop.
What I would like to do is generate an excel file where I have one column with the file name and the next column contains the extracted data in text form (concatenated if multiple lines). Basically matching the extracted data with the file it was extracted from.
Is there a node for this kind of fuzzy/variable length search term in KNIME?
Note: I have started by Having Tika Parser generate a table of one line per file that contains the entire file in text format, but am struggling to then search this data and match it to the original file.

Thanks for your help,

swatkat9 · June 5, 2018, 12:32pm

Hi Antoine,

Why don’t you use wildcard tagger node with matching expression like 1H NMR:* ?

Hope it works !

Thanks.

system · June 2, 2023, 9:45pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.