doubts on NGram analysis

Hi,

In NGram node there are 2 option under Input/Output settings output table.1.NGram frequencies 2 NGram bag of words.When I select .NGram frequencies getting output in 3 column 1.Corpus frequency. 2.Document frequency

3.Sentence frequency. Under Corpus frequency  word count is getting doubled though there are less word.For example suppose there are total no of word is 50 ,under Corpus frequency  the count is showing 100. Could any body justify why this happen?

And when I select  NGram bag of words I am getting  only one colun as out put Document frequency and the word count also varies from NGram frequencies.

Could any body explain some  detail about  NGram frequencies and NGram bag of words also its uses.

 

Thanks,

Madan

Hi Madan,

The Corpus Frequency returned by NGram creator node is the total count an ngram occurs in your corpus, i.e. the whole table. Document frequency is the number of individual documents in which an ngram appears, and Sentence frequency the number of sentences in which the ngram is found. Therefore, the number for corpus frequency can be higher than the number of words in an individual document (i.e. in one row of your table).

The "NGram bag of words" option creates an output data table consisting of ngram and document tuples. A tuple represents the occurrence of an ngram in a document. Additionally the frequency column contains the number of occurrences of the ngram in the document.

Cheers,

Roland

thanks Roland...