Text Processing PDF - Preprocessing of PDF for Text Mining

Hello, I am actually a beginner and I hope, that my question is not all to simple. But I am trying to process an equity report in pdf-format. Thus, I tried to use the Tika Parser. The issue is, that the pdf contains also diagrams and other visuals and the tika parser does not respects the structure of the pdf when converting it to text. I can also not just filter out “numbers” because I lose then also the “relevant numbers” within the text. So basically I just want to process on later the “core” text. Do you have any idea how it would be best to tackle this issue? Is there an advanded filter that I can apply?

Thanks a lot!

Hi @JacquelineR188,

parsing PDFs can indeed be quite cumbersome since PDFs are not really designed to be parsed. There are different formats and schemas that are defining how PDFs are written which often makes it hard to read them properly. The Tika Parser is the way to go however there is no out-of-the-box filter that extracts just the core text. You would need to try to apply different preprocessing nodes until you get an acceptable.

Best,
Julian

1 Like

Hi Julian,

thanks for you response. I also realized that the preprocessing part of a pdf containing many different diagrams and images can be a bit more time-consuming and difficult that expected. However, I found a pattern within my PDF: figures are enclosed with the symbols “|” - do you think I could define a regex expression to extract all the necessary information between these symbols? And if yes, do you know wether there’s an example workflow also handling with PDF’s clearing using Regex?

Thanks a lot!

Hey @JacquelineR188,

yes that should be possible
is it possible to send a small example from the PDF? I would like to have a look and give it a try.
Otherwise I’d recommend to have a look at the String Manipulation node which provides several functions to process and manipulate texts, for instance the regexReplace function that allows you to replace matching expressions with another string. The Column Expressions provides same functionality but you can also write JavaScript expressions to work with regular expressions.

Best,
Julian

2 Likes

Yes, I can do that. But unfortunately it is said, that PDF’s are not allowed to be uploaded. Is there any other option?

Best,
Jacqueline

I just attached the front page of the report and the fourth page as an example in png format.

Thank you,
Jacqueline

Hi @JacquelineR188,

can you upload it elsewhere and send the link to me via direct message?
I will have a look then.

Best,
Julian

Hi @julian.bunzel,

I uploaded the pdf file on Google Drive. Here’s the link: DeutscheBank.pdf - Google Drive

Will that work for you? Thank you very much that you want to help me :smiley: and sorry for all the inconveniences.

Best, Jacqueline

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.