find best matching Documents (PDF,Word-Docs) for given text input

Hello,

i´m kind of new to KNIME. My task: Build a search function over different types of documents (mostly for PDF or Word Documents), to find best matching documents for a given text input (no exact matches, eg. text input : damage; match in docs: scratch etc.)

What i´ve done for PDF-Files (attached workflow):

-data import

-in my opinion i´ve finished the preprocessing part

Now i´m struggling with the tranformation and clustering part. I already searched through the forum/example workflows and looked over similarity search https://www.knime.com/forum/r-statistics-nodes-and-integration/similarity-search , but it doesn´t helped so far.  I´m not sure what i have to do next.

 

i hope you understand where my problem is.

 

fynn

Hi Fynn,

with a similarity search based on cosine distance on basic document vectors and the query vector you will not overcome the problem of exact matching. Of course you can do stemming or lemmatization, which helps a bit with this problem (especially lemmatization), but you will still not be able to match damage to scratch.

What you are aiming for is a semantic matching or searching. This is not an easy task.

Methods that you could try are e.g. LSI/LSA and Doc2Vec. Similarity search using cosine distance on LSI vectors or Doc2Vec vectors of documents could work.

Another possible method, if you want to use basic document vectors, would be e.g. keyword expansion of the search query by e.g. frequent term neighbors.

Or you could use a dictionary approach or semantic network approach and also index documents by the synonyms, generalizations, concepts, of their terms.

 

Regarding you workflow and clustering problem:

I don't fully understand why you want to do a clustering on the results of the similarity search. The result of the similarity search gives you the most similar documents to your query. This is already your search result.

A clustering of documents would make sense before searching them. You could e.g. cluster them using k-medoids or k-means into X groups. Then you are doing the similarity search only on the means.

Hierarchical clustering makes sense if you have only a few documents e.g. below 2000 or so. The algorithm is computationally expensive and not useful for larger data sets. Alternatively use k-medoids or k-means.

I hope this helps.

Cheers, Kilian

Hello Kilian,
thanks for your quick and helpful answer. I realised that the task isn’t that easy to solve. I decided to break it down to an simpler workflow (semantic search isn’t out of my mind, i will give it a try later). Now the workflow just searches for exact matches (string matcher node) and will list the best matching documents in an excel file.

Maybe you can help as well with two new problems.

  1. file path:
    -in the end it would be awesome if the worklflow opens my matched documents just with a mouse click
    -at the moment the list of documents will written in an excel file ( name, distance, nbr of matches, file path), there i transform the file path into an hyperlink
    -the extracted file path has some issues; it’s the file path from the in KNIME parsed document e.g. M:\bbw88019\user\knime_3.5.2"M:\bbw88019\user\fynn\examples for knime\textmining\example.pdf" (will not open via hyperlink)
    i would like to have the exact file path in my directory e.g. M:\bbw88019\user\fynn\examples for knime\textmining\example.pdf
    Do you understand where my problem is? I just need the terms in the quotation marks

  2. terms position in a document
    -later it would be helpful to have the to position from the matched word in a document (in documents with many pages it takes a lot of time to find the matched word)
    -is there a node/way to get that kind of information or high light the word in the original pdf

I hope you can help me with my new problem.

Cheers, fynn

simple search function.knwf (56.9 KB)