Slow text mining PDF processing

Hello all,
I have a text mining workflow to process pdf files. I am using the dictionary tagger because I woud like to find specific words in reports.
However, when I upload a pdf file, e.g. a scientific paper of 12 pages, my KNIME get stuck or, the workflow is extremely slow.
Do you know why?

Thanks all!
Cheers,
Nazareno.

Hi,
is there a specific node that slows down or is it the whole workflow? Can you share the workflow and an example file with us so we can analyze the runtime better?
Kind regards
Alexander

1 Like

Hi Alexander,
Thanks for the answer.
The “number filter” node is quite slow and then in the “Punctuation Erasure” node the workflow get stuck.

workflow: Dictionary tagger.knwf (28.5 KB)
Paper: https://reader.elsevier.com/reader/sd/pii/S0160412017322328?token=53A0814768C9C1D62B75AC40AEFD05039DEB3419931BC9967CFD66757576E2C47224CC0E6F27A0167E09A3E1717CF726

1 Like

Hi,
can you also share the file “OHEJPGlossary.xlsx” so I can execute the workflow myself?
Kind regards
Alexander

Hi Alexander,
for several reasons I can not share the original file. Anyway, here you can have a file as example.
Best regards,
Nazareno.

OHEJPGlossary2.xlsx (9.0 KB)

Hi,
I think the problem is that you have the Bag Of Words Creator before all the preprocessing. It should come at the end of that because otherwise you will apply the preprocessing to each document reference stored in the rows for the individual terms. If you have a look at your data, you will see that the Bag of Words Creator creates one row for each term, but also keeps a reference to the original document in another column. The preprocessing nodes do not know that it is always the same document and just do their work on each row.
Another thing that is strange in your workflow is that you use Strings to Document, then Column Filter where you filter out the document and then String to Document again. I think that is a bit redundant.
Kind regards,
Alexander

2 Likes

Hi,
I thought that the Bag Of Words Creator has to be always at the beginning. However, I moved the Bag Of Words Creator, Tag Filter and TF at the end of the workflow but, althought it does not get stuck, it is still very slow.
Kind regards,
Nazareno.

Hi @Nazareno -

Have you tried increasing the memory allocated to KNIME, as described below? For example, on my laptop with 16GB of memory, I set my Java Heap Space to 12 GB.

https://www.knime.com/faq#q4_2

This is generally a good thing to do, but it’s especially important for text processing workflows, which tend to be resource intensive.

1 Like

Hi Scott,
last week we also tried that and we set the memory to 8GB but did not changed much. Unfortunately our PC are not 16GB.
Do you think this might be the cause?
Cheers,
Nazareno.

Try this. I edited your workflow a bit (based on what Alexander already mentioned - removed extra Strings to Document and filtering nodes, moved Bag of Words to the end) and added a Timer node.

I ran this on my desktop, and it executed in just a couple of seconds. Granted, I’m only parsing the single document you linked to.

Is this any faster for you?

Dictionary_tagger_SF_Edit.knwf (35.0 KB)

3 Likes

Good morning Scott,
it is fast now!! I thought the order of the nodes was important but it seems not…
Thank you!
Cheers,
Nazareno.

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.