Compare PDF files

Hi,
Is it possible to compare the structure of two PDF files?
I have monthly reports, which are automatically compiled from the database into PDF format. I am using the Python PyPDF2 library to read the data, and then directly compare it to the database report. to find possible differences or inconsistencies presented when passing the information to the PDF.

However, I must validate not only the data but also the format of the PDF, since in the data generator in the pdf, it sometimes generates inconsistencies, such as, for example, the data being outside the corresponding field, or that they are overlapped, I show you an example of what the correct format would look like, and how it looks corrupted:
Data FixData

Is there a way I can convert PDF to image and compare it against a standard format? To validate that the boxes are always in the same position, just like the data

This is an interesting question. I don’t have that much background in image processing myself, but let me see if I can rope in someone who does.

1 Like

Hi @daviddelos

There are no nodes to directly convert a PDF to an image, which is what you would need here. You could take a look at calling the poppler library using python or java. There is an example workflow in an old post of mine on this topic:

best,
Gabriel

2 Likes

Hi @gab1one
Thanks for answering.
What you say is would do the same as the Tika Parser node?
Because if so, this does not work for me, because you can see in the images of my publication, to extract the images from the pdf, the only thing that this node will do is save the gray boxes as images, but the text does not, so no I have how to compare if the text was a good way in the box.

That is why what I require is to convert the entire pdf as such to an image, in order to analyze whether the text is being left in the proper position.

I think you need to create a new workflow, similar to the one I posted, that uses different poppler commands to get a rendered image of the whole pdf instead of the pictures contained in it.
I did not explain that, so I am sorry I caused that confusion.

best,
Gabriel

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.