I have noticed a slow down of the Strings to Document node after manipulation of the text to include in the document. In the example workflow attached, the node executes in 2 mins over a PDF file of 1600 pages when no manipulation is performed, but in 18 mins when the regex filter is applied. Without getting into the practical value of the task, I would like to understand the reason for this dramatic slow down that, from a user’s perspective, is totally unexpected - if anything, the node should actually run faster as the regex node removes some terms. I have noticed the same issue when a string manipulation is performed after the Document Data Extractor node and before the conversion back to a document.
test_time_difference.knwf (33.7 KB)
Hi @mpenalver and welcome to the KNIME forum,
Is it possible for you to provide the PDF file as well?
May I ask why do you extract the content and convert it to document again?
As I said, let’s not focus on the usefulness of the example provided. There are situations in which converting a doc to a string and back to a doc with more complex processing in between might be appropriate. The point of the post is why a simple variation to the doc’s text, before or after the conversion to a string, should make the conversion back to a doc so much more costly when no new content has been added (actually some was removed).
Any explanation for this unexpected performance penalty?
Just to clarify. The loop in the workflow is only to average the time spent by the Strings to Document node over several executions. The question is why a regex node that eliminates terms makes the conversion back to a doc so much more expensive than leaving the original terms intact.
Hi @mpenalver -
Thanks for your workflow - I can see the same behavior here as well. It’s not obvious to me why this is happening, so I’ve asked one of our developers to take a look.
That’s great, ScottF. Very much appreciated.
due to removal of punctuation the sentence tokenizer recognizes the whole string as one sentence which apparently cannot be handled by the underlying framework in an efficient way. I need to dig a bit deeper to find the cause of this issue and will create a ticket for this.
Ok Julian. Thanks a lot for your feedback.
Just to keep you in the loop. @julian.bunzel found the performance leak and is currently fixing it. We’re confident that we will ship it with 4.2.
That’s great news, @Mark_Ortmann. Thank you very much!
thanks again for letting us know. We could fix the issue and it will be available in 4.2.
Very grateful for the speedy reaction. Such helpful support definitively enhances
the usefulness of this great platform.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.