Slow text mining PDF processing

Nazareno · March 23, 2020, 11:23am

Hello all,
I have a text mining workflow to process pdf files. I am using the dictionary tagger because I woud like to find specific words in reports.
However, when I upload a pdf file, e.g. a scientific paper of 12 pages, my KNIME get stuck or, the workflow is extremely slow.
Do you know why?

Thanks all!
Cheers,
Nazareno.

AlexanderFillbrunn · March 23, 2020, 11:30am

Hi,
is there a specific node that slows down or is it the whole workflow? Can you share the workflow and an example file with us so we can analyze the runtime better?
Kind regards
Alexander

Nazareno · March 23, 2020, 1:00pm

Hi Alexander,
Thanks for the answer.
The “number filter” node is quite slow and then in the “Punctuation Erasure” node the workflow get stuck.

workflow: Dictionary tagger.knwf (28.5 KB)
Paper: https://reader.elsevier.com/reader/sd/pii/S0160412017322328?token=53A0814768C9C1D62B75AC40AEFD05039DEB3419931BC9967CFD66757576E2C47224CC0E6F27A0167E09A3E1717CF726

AlexanderFillbrunn · March 23, 2020, 2:07pm

Hi,
can you also share the file “OHEJPGlossary.xlsx” so I can execute the workflow myself?
Kind regards
Alexander

Nazareno · March 24, 2020, 10:36am

Hi Alexander,
for several reasons I can not share the original file. Anyway, here you can have a file as example.
Best regards,
Nazareno.

OHEJPGlossary2.xlsx (9.0 KB)

AlexanderFillbrunn · March 24, 2020, 11:06am

Hi,
I think the problem is that you have the Bag Of Words Creator before all the preprocessing. It should come at the end of that because otherwise you will apply the preprocessing to each document reference stored in the rows for the individual terms. If you have a look at your data, you will see that the Bag of Words Creator creates one row for each term, but also keeps a reference to the original document in another column. The preprocessing nodes do not know that it is always the same document and just do their work on each row.
Another thing that is strange in your workflow is that you use Strings to Document, then Column Filter where you filter out the document and then String to Document again. I think that is a bit redundant.
Kind regards,
Alexander

Nazareno · March 24, 2020, 1:13pm

Hi,
I thought that the Bag Of Words Creator has to be always at the beginning. However, I moved the Bag Of Words Creator, Tag Filter and TF at the end of the workflow but, althought it does not get stuck, it is still very slow.
Kind regards,
Nazareno.

ScottF · March 24, 2020, 2:20pm

Hi @Nazareno -

Have you tried increasing the memory allocated to KNIME, as described below? For example, on my laptop with 16GB of memory, I set my Java Heap Space to 12 GB.

https://www.knime.com/faq#q4_2

This is generally a good thing to do, but it’s especially important for text processing workflows, which tend to be resource intensive.

Nazareno · March 24, 2020, 2:43pm

Hi Scott,
last week we also tried that and we set the memory to 8GB but did not changed much. Unfortunately our PC are not 16GB.
Do you think this might be the cause?
Cheers,
Nazareno.

ScottF · March 24, 2020, 7:17pm

Try this. I edited your workflow a bit (based on what Alexander already mentioned - removed extra Strings to Document and filtering nodes, moved Bag of Words to the end) and added a Timer node.

I ran this on my desktop, and it executed in just a couple of seconds. Granted, I’m only parsing the single document you linked to.

Is this any faster for you?

Dictionary_tagger_SF_Edit.knwf (35.0 KB)

Nazareno · March 25, 2020, 9:01am

Good morning Scott,
it is fast now!! I thought the order of the nodes was important but it seems not…
Thank you!
Cheers,
Nazareno.

system · June 2, 2023, 9:42pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.