Simple question about Term Frequency

RIchardC · May 11, 2021, 6:42pm

I have a table with a single column of text called Source. I have another table with a text column called Terms and a text column called Category.

I want to create a table that shows each Term, its Category, and the number of times that Term exists in the Source table.

There seem to be a lot of ways to approach this. Any ideas of best practice?

Thanks, Richard

ScottF · May 11, 2021, 8:35pm

Hi @RIchardC and welcome to the forum!

Without any sample data to go on, this is a general approach I would try:

Strings To Document node to convert Source to a KNIME document, making sure to configure the node to apply the Category metadata
Bag Of Words Creator to… create a Bag of Words
TF node to calculate the absolute term frequency

Then some subsequent aggregation and joining to compare to your original list of terms, but exactly how this is done will depend on the format of your data. If you have a small example dataset I could try to build a toy workflow for you to check out.

RIchardC · May 12, 2021, 3:48am

Thanks so much. I uploaded some sample data here

I started with an RSS Flow, which is eventually how I will end. Cheers.

aworker · May 12, 2021, 7:17am

Hi @RIchardC and welcome to the KNIME forum

I tried the link to get the data in your last post but doesn’t seem to work. Could you please check and post it back ?

Thanks & regards,

Ael

RIchardC · May 12, 2021, 12:32pm

Sorry, grabbed the wrong url. This should work:

https://hub.knime.com/richardc/spaces/Public/latest/RSS%20Frequency%20Sample%20Data

aworker · May 12, 2021, 2:10pm

Hi @RIchardC

Thanks for the data. Attached here you’ll find a workflow solving this using two different solutions, the one suggested by @ScottF (thanks Scott ) and another one based on classic KNIME nodes with eventually comparison of results. The frequencies are calculated per document. If you need the frequencies globally then the same workflow could do it but you need change the grouping of the groupby node (then do it only based on column “description (filtered)_SplitResultList” and not on “Doc_Index”) or to aggregate the documents first in a single row (what suits best to you).

20210512 Pikairos Simple question about Term Frequency.knwf (841.3 KB)

Please get back in touch if more explanation is needed.

Hope this helps.

best

Ael

RIchardC · May 12, 2021, 5:42pm

Hi @aworker

This is fantastic. I love how I can study a Knime flow step by step and it makes perfect sense.

Now that I see how it’s done, I see a flaw in my question, which is that all my dictionary terms were one word. In reality, I’ll need to search for multi-word terms, e.g. “Boris Johnson”.

Since the join node is where the dictionary comes into play, I’ve studied the docs for that, as well as Cell Replacer, but I’m not seeing a way forward.

If I were programming this in a different language, my instinct would be to iterate through each dictionary item with a “contains” test on the description column. If it makes it any easier, I don’t want to count how many times the dictionary term is mentioned in each article – I only need to count how many articles mention the term.

Is there a way for Knime to do that?

Thanks, Richard

RIchardC · May 13, 2021, 2:29pm

I found the trick of using the Dictionary Tagger to create multi-word tags. Then I realized I needed Document Frequency rather than Term Frequency.

Thanks again for the help.

aworker · May 13, 2021, 3:03pm

Hi @RIchardC

Excellent news !

Could you please upload your workflow solution so that other people can better understand it and benefit ?

Thanks

Ael

RIchardC · May 13, 2021, 3:59pm

Here’s the end result, plus I added a Date filter.

system · May 20, 2021, 3:59pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.