String manipulation problems, possibly related to PDF parser

AngusVeitch · August 11, 2017, 6:15am

After much frustration, I've identified an inconsistency in how the String Manipulation node (specifically the regex replacer) performs on text that has arrived through different pathways.

I tried but failed to perform a fairly simple regex replacement on a set of documents that originated via the PDF Parser. I was working with the strings produced by the Document Data Extractor, rather than with the Knime document format produced by the PDF Parser. The operation was supposed to find hyphens followed by line breaks and more text, and remove the line breaks. The specific code I used was <(-[\\r\\n]+)(\\S+)" ,"-$2")>, but several other approaches failed as well.

Eventually I decided to write the data table to a CSV, and then load the CSV and apply the same replacement operation. Lo and behold, it worked! And it continued to work after I converted the strings to documents and back again, so I don't think the issue is with the Document Data Extractor.

So now I have a workaround, at least. But ideally I don't want to interrupt my workflow with the writing and opening of a CSV file every time. So can anyone suggest an actual solution, or explain why this happens in the first place?

Note that I'm currently working with Knime v3.2.1 rather than the latest version (the automatic update hasn't been working for a while), so my apologies if this issue has since been resolved.

kilian.thiel · August 16, 2017, 8:59am

Hi Sugna,

which node did you use exactly, that did not work? I am using version 3.4 and there is not Regex Replacer. However there is a Replacer node in the Textprocessing extension. Is the node that you are using part of the Textprocessing extension? If so, your regex will not work because the node works on tokens. Words that are combined by e.d. a hyphen (in the original data) will most likely be split up by the tokenization into two tokens. The regex is used on each token one after the other. The hyphen itself will also be a token. This means that your regex will never match.

To use a regex you need to work directly on strings. You can extract the textual data as string from documents by using the Document Data Extractor. Is that what you tried as well? Can you share a workflow with example data?

Cheers, Kilian

AngusVeitch · August 16, 2017, 12:12pm

I’m working directly on the strings, using the regexReplace function in the String Manipulation node. I’ll try to put together an example workflow shortly.

AngusVeitch · August 17, 2017, 5:31am

I tried to make an example workflow but I could not replicate the problem. Then I tried my original workflow again, and I still couldn't replicate it. Now the regex command just works as it should. Sigh.

So either I am officially going crazy or something strange was happening that has since fixed itself (I have restarted the computer since then, for example). I'd like to think it is the latter, since I spent several hours trying to get it to work before, and I have not changed anything in the workflow that should have made any difference.

If I notice the problem again, I'll let you know.

system · June 2, 2023, 9:46pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.