Hi,
I’m not sure about the issue, but I need to analyze scanned PDF documents. The text is generally recognizable and readable, but in some cases, the scan did not go well. Some parts of the text are not marked. Unfortunately, this is a fact, and I have to deal with this disadvantage.
It involves loading approximately 50,000 PDF documents. So far, I have about 2,500 PDF documents available. I tried to load and check the condition of the scanned documents. I used Tika parser and PDF parser. The content is visible in the preview and is relatively okay.
However, if I want to select only certain parts of the text, I split the text using CELL SPLITTER. Unfortunately, I found that the scanned document does not have a line break. In the Tika Parser preview, it shows that there is a line, but such a character is not there. I don’t understand why line breaks are displayed in the Content preview, but there is no line when parsing the text. This causes me great problems because I cannot accurately select parts of the text from the document. If the line was identified, it would be relatively simple, but if it is not there, some words are repeated in different places, which prevents the correct identification of information.
The OCR I use is IBM Datacap OCR/A. Unfortunately, the information is highly sensitive, so I cannot include it here. I do not have access to the scanner, so I cannot simulate scanning on invented data. I am attaching only screenshot(Blue background ,text is recognized in pdf white not). Is there any way to identify the end of the line through Tika parser or PDF parser?