doubts on NGram analysis

madanbehera · June 2, 2017, 10:39am

Hi,

In NGram node there are 2 option under Input/Output settings output table.1.NGram frequencies 2 NGram bag of words.When I select .NGram frequencies getting output in 3 column 1.Corpus frequency. 2.Document frequency

3.Sentence frequency. Under Corpus frequency word count is getting doubled though there are less word.For example suppose there are total no of word is 50 ,under Corpus frequency the count is showing 100. Could any body justify why this happen?

And when I select NGram bag of words I am getting only one colun as out put Document frequency and the word count also varies from NGram frequencies.

Could any body explain some detail about NGram frequencies and NGram bag of words also its uses.

Thanks,

Madan

RolandBurger · June 7, 2017, 11:05am

Hi Madan,

The Corpus Frequency returned by NGram creator node is the total count an ngram occurs in your corpus, i.e. the whole table. Document frequency is the number of individual documents in which an ngram appears, and Sentence frequency the number of sentences in which the ngram is found. Therefore, the number for corpus frequency can be higher than the number of words in an individual document (i.e. in one row of your table).

The "NGram bag of words" option creates an output data table consisting of ngram and document tuples. A tuple represents the occurrence of an ngram in a document. Additionally the frequency column contains the number of occurrences of the ngram in the document.

Cheers,

Roland

madanbehera · June 9, 2017, 7:28am

thanks Roland...

system · June 2, 2023, 9:45pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.