Comparing words with a text file.

alamsaqib · May 17, 2016, 5:24am

Hi all. I need some help. I have 5 CSV files named, 15Century, 16Century, 17Century, 18Century and 19Century, in these files there 200 most frequently used words from each century. I have an other text file named "unanimous doc.txt". Now i want that compare all these 5 files (centuries) with this one file (unanimous doc.txt) and in result it show me that from which century this unanimous doc belongs to. I am attaching all files. I need a work flow for this.

Best Regards

Alam

example.zip

qqilihq · May 17, 2016, 3:00pm

Where specifically are you facing problems?

Philipp

alamsaqib · May 18, 2016, 5:37am

Hi Philip. Thanks for your reply. Acctually i am not that much good to KNIME, i am just a beginner. I need a workflow for that problem. I didnt create it.

Alam

qqilihq · May 18, 2016, 10:03am

Building the workflow yourself is the best way of learning how things work :)

Some suggestions on getting you started: Have a look at the Text Processing extension to transform your text document to the same stucture as the XXCentury files (lines with [word, count]).

Then compare the result to the individual XXCentury files. You could try transforming the word counts to probabilities and then apply some measure for comparing probability distributions such as Kullback-Leibler divergence.

Good luck,
Philipp

alamsaqib · May 19, 2016, 5:35am

Thanks a lot Philipp. I will try to do it by myself. If i stuck some where or got some problem then i will let your help :)

Alam

qqilihq · May 20, 2016, 11:59am

Hey Alam,

feel free to do so. Quite interesting task, btw :)

Best,
Philipp

Geo · May 20, 2016, 4:43pm

A book such as "Data Science for Business" may be of guidance. It will not provide you with the KNIME workflow (or with any other code) but it will give you the general approach needed for text mining (as well as an example use case very similar to your's).

alamsaqib · May 29, 2016, 5:01am

I want to compare doc1 with doc2 and doc3. In the attached workflow,

1. doc1 is comparing with doc2 and doc3 but it also comparing itself with (doc1).

2. doc1 shows the same similarity 0.707 with doc2 and doc3 whereas doc2 and doc3 are totally different.

I hope you understand my questions.

Regards

Alam

doc_similarity.zip