Challenges in Analyzing Scanned PDF Documents

MarekV · February 27, 2025, 9:44am

Hi,
I’m not sure about the issue, but I need to analyze scanned PDF documents. The text is generally recognizable and readable, but in some cases, the scan did not go well. Some parts of the text are not marked. Unfortunately, this is a fact, and I have to deal with this disadvantage.

It involves loading approximately 50,000 PDF documents. So far, I have about 2,500 PDF documents available. I tried to load and check the condition of the scanned documents. I used Tika parser and PDF parser. The content is visible in the preview and is relatively okay.

However, if I want to select only certain parts of the text, I split the text using CELL SPLITTER. Unfortunately, I found that the scanned document does not have a line break. In the Tika Parser preview, it shows that there is a line, but such a character is not there. I don’t understand why line breaks are displayed in the Content preview, but there is no line when parsing the text. This causes me great problems because I cannot accurately select parts of the text from the document. If the line was identified, it would be relatively simple, but if it is not there, some words are repeated in different places, which prevents the correct identification of information.

The OCR I use is IBM Datacap OCR/A. Unfortunately, the information is highly sensitive, so I cannot include it here. I do not have access to the scanner, so I cannot simulate scanning on invented data. I am attaching only screenshot(Blue background ,text is recognized in pdf white not). Is there any way to identify the end of the line through Tika parser or PDF parser?

prashant7526 · February 27, 2025, 12:59pm

@MarekV , First of all, someone needs to know what is going on in your extraction. Don’t put scanned, but try to imitate the line with some dummy data in the same format that is coming in a single line (after the body extract).

Then only someone can help.

mlauber71 · February 27, 2025, 1:05pm

@MarekV I have tried these approaches to use local LLMs to extract information from PDFs. You could try to adapt these two approaches with Ollama and GPT4All. It will involves quite some prompting and then to define and force a JSON structure that you can then import.