Some questions to Textprocessing - Tag Cloud, Stop word Filte, Dict Replacer, Sentiment Analysis

Hi, 

I am new to Text Mining with Knime. Therefore I have some questions where I could not find the answers in this Forum. I hope that someone could maybe give me some hints: 

1. Tag Cloud 

Can somebody explain, how the order of words is made by the Tag Cloud? I mean especially how one word is ordered to another word, e.g. one word has a high frequency and another word has a low frequency. In the Tag Cloud both are arranged very close. Can I influence this order or is it by chance?

2. Stop word Filter 

Which stop word lists are used in the "Stop word Filter"? Can I find them somewhere? I especially need the German stop word list. 

3. Dict Replacer

Can somebody tell me where I can find a dictionary that I can use for the "Dict Replacer"? With this node I want to reduce the (ordinary) spelling mistakes - or is there any other possibility?

4. Sentiment Analysis

For the sentiment analysis I also need a list of positive and negative words. Do I need to create this on my own or is there any list I can use to make the sentiment analysis? 

Thanks a lot for your answers! I hope you understand my difficulties - thanks for the pacient for a beginner :) 

Jasmin

Hi,

1. The order of the words in a Tag Cloud depends on the selected ordering. Alphabetical is clear. Inside out orders the terms in a way that the most frequent terms are inside and less frequent terms are placed more and more outside.

2. Standard list for each language are used. If you want to specify your own list you can do that in the dialog of the node. Simply specify the path to your custom stop word file where each word in stored in a separate line. The build in stop word lists are in .../<Your KNIME Dir>/plugins/org.knime.ext.textprocessing_2.12.0.0046563/resources/stopwordlists

3. You have to provide the dictionary yourself. The node brings no dictionary.

4.Yes, again you have to bring your own dictionary. For English language check out the MPQA Corpus - subjectivity lexicon: http://mpqa.cs.pitt.edu/ and http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/

Cheers, Kilian

Hi Kilian, 

thanks a lot for the fast and helpful answer :) 

1. Yes, this what I understood. But is there any connection between the order and the combination of the words? I mean, if eg. "weather" is mentioned very often in the document corpus and the word "bad" is ordered right beside it, does this has any connection, too? Or is it only depending on the frequency? 

2. Yeah, I found it :) 

3. However, do you have an advice what a diccionary one could use? 

4. Do you know about a site that provide this for German? 

Thank you so much, it helped a lot!

Jasmin

Hi Paraguas,

great that the answers were usefule for you.

3. this depends on the task you want to achieve ;-). What do you have in mind?

4. maybe the German polarity clues are helpful http://www.ulliwaltinger.de/sentiment/

Cheers, Kilian

Hi Jasmin,

about 1: No, if two words are places next to each other this does not imply a connection between them.

It only tells you that they have a similar frequency. 

Best, Iris

 

Hi Kilian, 

about 3: With a dicctionary I just want to get rid of spelling mistakes, so that in total the frequencies for words are higher. I thought about a something like the Microsoft Word Spell Checher in KNIME - is there any possibilty to do so? 

@Iris: Thanks for this explanation! :)

Greetings, Jasmin

 

 

Hi Jasmin,

sorry, I don't know of any dictionary that can be used for that. However, I think the Dict Replacer node is not the right node for that. The node requires a dictionary with words to search for and replacements. If you want to cover all possible spelling mistakes and the correct words this would become a huge dictionary. I am afraid automatic correction of spelling mistakes is not possible with the text processing nodes. Do you know of any open source Java solution that can do this? After all KNIME is extensible and new nodes can be easily integrated.

Cheers, Kilian

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.