I'm currently working on a use case to match one document to a corpus of a large number of documents and find the document with the closest match. I tried the string similarity node but the results don't make sense. I did some research and found out that Latent Semantic Analysis might solve this use case. But Knime seems not to offer this. Are there alternatives or do I have to make use of R (for instance)
Whatever the algorithm to be applied, the documents, including the one to be matched, should be preprocessed. I suppose that's been done.
If the String Distances node is connected to Similarity Search, you'll have a wider offer of string distance functions at your disposal. The choice depends indirectly on considerations such as importance of word order, word frequency, document length, etc.
If this simple approach fails, you could try the Bag of Words approach. For your application, you would boost rarer words rather than only frequent words, so the combined term weight TFxIDF is appropriate to build a term dictionary. After term feature selection, Similiarity Search can be applied using a numeric distance function. The choice of the function depends on whether bigger differences should be penalised or not.
The advantage of the Bag of Words approach is that you'll be able to trace the search results to the source. This will give you precious insight into the adequacy of the preprocessing and term dictionary steps.
Only after all this failed, I would go for more complex methods.
Thanks Geo. I tried the string similarity but the length and content of the documents differs to much to get good results.
The Bag of words approach sounds interesting. I have doen alle the preprocessing and have a wordvector available. I used the similarity search node to find the closest documents. I already see some interesting things but need to get rid of the noise to get usable results
Which distance functions did you use ? For the BoW based analysis, Cosine or Manhattan distance may be a good choice. It should also be a good idea to normalize for document length, e.g. by dividing the term weights by the number of words or characters per document. Otherwise big documents will always dominate the term dictionary and the resulting document vector will produce after all not so relevant results for your queries. Finally, you may have to clean some domain specific stop words, which are easily identified by the absolute TF for all documents.
Thanks again for the advise. Normalizing for document length: to be sure, this is also necessary in case of relative TF? I expect to normalize against the count of terms involved in the distance calculation, not for the original document length, right? And how to do this in practice? I can calculate the number of (involved) terms per document. You suggest to devide the term weights by this number. But isn't is possible (and easier) to devide the result of the distance calculation by the number of (involved) terms per document? If not possible, how do I devide all numbers in an entire (and dynamic) wordvector by the number of words?
I already cleaned for domain specific stopwords. I also analyzed the most relevant words per category and filtered the wordvector for these words. The result seems to be better then with the complete wordvector.
By adjusting the term weights for document length, bigger documents will be less likely to populate the relevant term dictionary. As a consequence, the said term dictionary will better represent the overall corpus. Whether you use relative TF or absolute TF does not matter. Actually, the most known term weight used for information retrieval is TFxIDF, which will boost the TF of rarer terms.
Perform the document length adjustment on the unpivoted bag of words, where you have a single term weight column. A single Math node will do, after you've extracted the number of terms using Document Data Extractor (on the cleaned documents). After this step, proceed with generating the term dictionary and creating the document vector based on the dictionary. Then apply the same preprocessing on the query document(s), except that any words not present in the term dictionary should be filtered out.
Thanks for this article.Good overview of this complex subject.
I made some progress with similarity search and also implemented normalization of document length. Sometimes the flow gives very good results, sometimes not. I am thinking of an automated feedback mechanism of relevancy of words. Say I got ten results of my matching, 8 good and 2 poor. If you evaluate the differences between the good and the poor wordvectors you will find differences. I'm thinking of a mechanism to add the terms that lead to a wrong match to a reference table that excludes these terms in future for the matching. To break this question down peaces:
- Is it possible to give feedback on a flow?
- I need an algorithm that searches for differences in wordvectors and find the terms of the 2 bad matches that differ from the 8 good
- The result of the algorithm will be a reference table that excludes columns if needed
There may be one more optimization to look for: how do you prepare the string to search for ? For example, IDF cannot be calculated based on the search string but IMO should be "imported" per term from the corpus. The same goes for the dictionary of relevant terms. Maybe you can elaborate on this ?
The search (dependant) string is prepared as part of the corpus of the independant strings. So the IDF is based on the complete set of independant strings (thousands) and the search string.
We tried to generate a dictionary of relevant terms based on the origin of the search strings but unfortunately this dictionary gives only a limited set of terms that exist in the independant strings and hardly don't discriminate between the independant strings. With the pivoting node I can make wordlists for classes that work well for these, but unfortunately most of the search strings are unique and are not easy to classify. Hence my idea for a feedback mechanism to let the flow learn from feedback on the results. In this way it will create a dictionary automatically that supports an optimal matching in every case. Is this doable?
Basically, terms from the search string also leak to the main corpus? Such a leak will influence the similarity calculation, which is not a good idea. The best would be to separate the two as follows: a) preprocess the main corpus, b) extract dictionary with total tfxidf and average idf per term, c) preprocess the search strings and calculate tf only, d) merge terms with the dictionary with the search string BoW, e) calculate tfxidf for the search strings using idf from the dictionary, and f) filter any terms not in the dictionary from the search strings.
Well, a lot of homework again :-) but I executed your suggestions. Unfortunately the result is far less then the former process, so there might be a mistake somewhere. Things I'm unsure of:
- TFxIDF for the corpus is calculated by using TF and put the button 'Relative' on. The result of this is not the same when I calculate the result of TF absolute * IDF. TF relative gives figueres like 0.00X and TF absolute x IDF give figueres like X.X
- The result of wordvector of the search string consist of terms that only exist in de dictionary of the corpus. But the number of columns/terms is less than the the wordvector of the corpus. Does the similarity node takes this in account or do I have to add the missing columns and fill them with zero's? It this is the case which node should I use?
Your fast answers to my questions are realy helpful, Thanks a lot.
ad b) You have to add the missing columns. After converting the dictionary into a reference table structure, you can create the term columns which do not exist in the search strings table. Then with Missing Value you should set missing to zero.
ad a) Relative or absolute should a priori not change the similarity calculations as the relative frequency is per document and not relative to the corpus. Yet, I think for the term dictionary you should use absolute TF because you may want to perform a term selection by applying GroupBy by term on the corpus and filtering the domain-specific stop words above a given absolute TF threshold - for a search application you do not need to put any lower bound.
The workflow with the suggested adaptions already works well with interestig results. But for production conditions we need to be more specific. I am considering to make use of semantics and am thinking of building an ontology in f.i. Protégé and using that for similarity purposes. Searching for ontologies on the Knime forum I saw two suggestions: making use of dictionary tagging and using network functionality. I think the network functionality is the most promising but how do I load a Protégé ontology as a Network in Knime? Formats don't seem to suit.
What kind of file format is available ? XML ?
I had gone through your dicussion and I hope you have sucessfully acehieved of what you are working on.
Actually I am also trying to do the same process in R-Language but its becoming a bit difficult to acehive it. A friend of mine suggest me to do with KNIME and in my R&D process, I found your dicussion and which is closely related to my requirment. If possible could you please attach your workflow and the supporting links(if there). So that it can help me to achieve my task sucessfully.
Awating for your reply,
Thanking in Anticipation.
If you want to pursue your endeavours in R, there's an excellent book named "Machine Learning in R" by Brett Lantz, which provides a hands-on section on text mining in R with the powerful, flexible `tm` package. That said, text mining in KNIME also works beautifully once you've gotten used to the behaviour of the Strings to Documents parser.