Matching correlations between PDF's, Text Processing


#1

Hello everyone,
Thank you ahead of time for your patients. I am slowly getting a better at understanding of what Knime is capable of processing.

Is there a work flow out there that is similar to one that will generate and compile a list of correlating PDF’s? In other words, a workflow that can compare two PDFs and distinguishes if there is a high correlation between the two. I am currently opening a bunch PDFs in Spanish and then translating them only to find that I am consistently copying over the same text from multiple file PDFs that are only formatted different. I dealing with alot of messy information documented in PDFs and thes PDFs must have been translated and copied over multiple times in the PDF’s lifetime. Ideally, The system would need to loop so I would be comparing 50 or so documents. I know this isnt a simple workflow but I’m trying to figure out where to start. I’ve watched multiple videos on youtube and scrolled through this form. Please tag any previous discussions in here that could possible help me generate some ideas. Anything will be much appreciated.

Thank you
Michael
Graduate Student


#2

Hi,

I am not aware of any solution designed for a task like yours, however instead of correlation
you may try clustering documents by k-means method. See screen below as well as attached workflow.

Some first nodes are aimed to text preprocessing and to create term vector, which will be used (in our simplified approach) as a document’s fingerprint. Then clustering based on the vector items is applied to group documents similar each other into clusters. There is no predetermined optimal number of clusters and you need to find it for yourself by experiments - check “number of clusters” option in the “k-means” node. We also don’t take into account term frequency (which would be also possible), just a presence of specific term in document. The solution is far from perfect, however I hope it helps.

Martin K.

image

PDF_clustering.knwf (38.0 KB)