Hi,
I have a problem with Regex within KNIME and it seems to be related to KNIME or my misunderstanding of it. I am new to KNIME but already sitting a while at this problem:
Basically I want to read PDFs and apply some Regex to filter out some data. The data is surrounded by certain combination of words, so I need to look for a combination of words which contain whitespaces. Problem is, that I am not able to get a working regular expression on a combination of words which contains white spaces. I tried the regex using a webbased tool for testing regexs, where it works fine, but in KNIME I am not able to get it working. My Testworkflow is:
PDF Parser -> Document Data Extractor --> String Manipulation
Example Text extracted in Document body text: “Das ist ein Test.”
Expression: regexReplace($Document body text$,"(Das ist)",“replacementtext”)
When the regex does not contain white spaces, it works:
regexReplace($Document body text$,“Das”,“replacementtext”)
I tried to find out the encoding of the pdf by copying it out in Adobe Reader and Document Viewer using external tools. Even though I am not 100% sure it seems to be UTF-8 so I changed in the PDF Parser the charset accordingly. However it did not worked. I get the same behaviour when I use the Document Viewer and the search field.
Can anybody help me with this?
Thanks
Martin