Hi there!
I have got an issue with the Tika Parser node. I read PDFs from disc but the content column has the content duplicated several times. Here is an example PDF: https://kursplan.lnu.se/utbildningsplaner/utbildningsplan-VASIN-2.pdf
You can reproduce that behaviour by downloading the PDF, parsing it using the Tika Parser and copying the extracted content from it into an external editor (or use other nodes to post-process the content). This doesn’t happen with all PDF documents, but I have many of them that the Tika parser duplicates. The provided example results in 4x duplicated content.
Thanks for uploading a sample document for us to check with. I can reproduce the behavior you’re seeing, and when I asked internally, one of our other data scientists mentioned he had run across this too. We’ll do a little more digging and come back with an update.
Can you reproduce (or not) this behavior with a non-KNIME engine? For instance with Python or a commercially available tool? I’m investigating if this is an issue with the PDF itself and not the Tika Parser. Thank you.
Thank you @ScottF for the confirmation and @victor_palacios for the investigation.
But isn’t it too easy to just blame the PDF itself? As a comparison, I have tried with an online tool which works fine (but couldn’t find out what kind of engine it uses). Maybe there is another Java or Python library that works and can be used with the Java snippet or Python script node. Or do you have any suggestion for a simple post-processing fix?
It may sounds odd, but for me Adobe Acrobat recognizes tables worst than Word. Also, inside pdf file my have multiple overlapping frames. If you open it with MS Word and resave as pdf it will be a plain text without frames. Possibly, Adobe do it for security purpose to preserve the original content. Unfortunately this makes text recognition much harder.