I have got an issue with the Tika Parser node. I read PDFs from disc but the content column has the content duplicated several times. Here is an example PDF: https://kursplan.lnu.se/utbildningsplaner/utbildningsplan-VASIN-2.pdf
You can reproduce that behaviour by downloading the PDF, parsing it using the Tika Parser and copying the extracted content from it into an external editor (or use other nodes to post-process the content). This doesn’t happen with all PDF documents, but I have many of them that the Tika parser duplicates. The provided example results in 4x duplicated content.
Any idea what’s wrong there?
VASIN-2_content.docx (40.9 KB)
Hi @ante and welcome to the forum.
Thanks for uploading a sample document for us to check with. I can reproduce the behavior you’re seeing, and when I asked internally, one of our other data scientists mentioned he had run across this too. We’ll do a little more digging and come back with an update.
Can you reproduce (or not) this behavior with a non-KNIME engine? For instance with Python or a commercially available tool? I’m investigating if this is an issue with the PDF itself and not the Tika Parser. Thank you.
I’ve reproduced this behavior in Python using the Tika library which suggests either the PDF or the Tika engine which KNIME is using causes this.
Update: It is indeed the PDF causing this issue. I tried using another Python package and it also create 4 copies of the text.
Notice that intensive appears 8 times where it only appears 2 times in the original document.
Shameless self promotion:
I’ll also be hosting a free webinar for PDF text extraction using regex, knime, and python if you’re interested:
Come learn how to read and extract data from PDFs using KNIME with Regex and Python.
Thank you @ScottF for the confirmation and @victor_palacios for the investigation.
But isn’t it too easy to just blame the PDF itself? As a comparison, I have tried with an online tool which works fine (but couldn’t find out what kind of engine it uses). Maybe there is another Java or Python library that works and can be used with the Java snippet or Python script node. Or do you have any suggestion for a simple post-processing fix?
It may sounds odd, but for me Adobe Acrobat recognizes tables worst than Word. Also, inside pdf file my have multiple overlapping frames. If you open it with MS Word and resave as pdf it will be a plain text without frames. Possibly, Adobe do it for security purpose to preserve the original content. Unfortunately this makes text recognition much harder.
I would try:
Tika Parser → Cell splitter (split each line by period) → Duplicate Row Filter
Tell me how that goes to remove the duplication of text.
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.