Corpus frequency in NGram Creator Node

Christian_Essen · June 10, 2022, 11:11am

Hello,

In my opinion there is something wrong with the "corpus frequency " in the NGram Creator Node. I know there are already forum posts with answers to this topic (https://forum.knime.com/t/ngram-creator/11493, https://forum.knime.com/t/doubts-on-ngram-analysis/8824 ) but either I’m lost or - as I said - there is something wrong.
As an example: In the workflow for the self paced course “L4-TP Introduction to Text Processing” (Exercise: “Bag of Words and Frequencies”) two documents are analyzed, each one a one-page-agenda for the KNIME TS and TP courses. When I add NGram Creator Node I get a very high value for corpus frequency. How can this be?

Thanks for your support and many greetings
Christian

badger101 · June 10, 2022, 3:14pm

@Christian_Essen Hi, can you share the link to the exercise? I tried looking here knime/Education – exercises – KNIME Hub but can’t find any entitled ‘Bag of Words and Frequencies’. (Or you can also upload the workflow directly here so I can have a look.)

Christian_Essen · June 10, 2022, 3:34pm

@badger101 : Thanks for your support .
The exercise is here: knime/spaces/Education/Self-Paced Courses/L 4-TP Introduction to Text Processing/

And this is the workflow:
05 Bag of Words and Frequencies (with NGram).knwf (334.3 KB)

badger101 · June 10, 2022, 3:47pm

Thanks! I have looked at the workflow. It seems that the N-Gram Creator node treats each preprocessed document in the previous BoW node as one, meaning that it’s analyzing a corpus of 120 documents instead of just 2 of the original documents.

Because of that, you see many duplicates of N-grams (e.g. ‘time series analysis’ as shown in your screenshot has 189 occurrences rather than 3.)

To validate what I’m saying, you can connect the NGram creator directly to the Concatenate Node without going through the BoW, and you’ll get this result:

In this case, the ‘time series analysis’ n-gram occurred 3 times within a corpus size of 2 (original) documents.

Christian_Essen · June 10, 2022, 4:18pm

@badger101 :That makes sense - thanks for the quick and very helpful reply!

One more note: If I’m not mistaken, the problem that the NGram Creator node is behind the BoW node also exists in the KNIME book “From Words to Wisdom” on page 87 (see screenshot) and in the associated workflow located here.
The corpus frequency there is absurdly high.

If I’m not mistaken, perhaps this should be reported to the authors of the book? .

badger101 · June 10, 2022, 4:28pm

I agree that it’s too high given the small forum size. Unless they have a specific reason, the NGram node shouldn’t be used after BoW.

Christian_Essen · June 10, 2022, 4:31pm

@badger101: Thanks again for your support!

badger101 · June 10, 2022, 4:31pm

Glad to help out! I’m off for today.

system · June 17, 2022, 4:32pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.