Hi all. I need some help. I have 5 CSV files named, 15Century, 16Century, 17Century, 18Century and 19Century, in these files there 200 most frequently used words from each century. I have an other text file named "unanimous doc.txt". Now i want that compare all these 5 files (centuries) with this one file (unanimous doc.txt) and in result it show me that from which century this unanimous doc belongs to. I am attaching all files. I need a work flow for this.
Where specifically are you facing problems?
Hi Philip. Thanks for your reply. Acctually i am not that much good to KNIME, i am just a beginner. I need a workflow for that problem. I didnt create it.
Building the workflow yourself is the best way of learning how things work :)
Some suggestions on getting you started: Have a look at the Text Processing extension to transform your text document to the same stucture as the XXCentury files (lines with [word, count]).
Then compare the result to the individual XXCentury files. You could try transforming the word counts to probabilities and then apply some measure for comparing probability distributions such as Kullback-Leibler divergence.
Thanks a lot Philipp. I will try to do it by myself. If i stuck some where or got some problem then i will let your help :)
feel free to do so. Quite interesting task, btw :)
A book such as "Data Science for Business" may be of guidance. It will not provide you with the KNIME workflow (or with any other code) but it will give you the general approach needed for text mining (as well as an example use case very similar to your's).
I want to compare doc1 with doc2 and doc3. In the attached workflow,
1. doc1 is comparing with doc2 and doc3 but it also comparing itself with (doc1).
2. doc1 shows the same similarity 0.707 with doc2 and doc3 whereas doc2 and doc3 are totally different.
I hope you understand my questions.