I have a table with a single column of text called Source. I have another table with a text column called Terms and a text column called Category.
I want to create a table that shows each Term, its Category, and the number of times that Term exists in the Source table.
There seem to be a lot of ways to approach this. Any ideas of best practice?
Hi @RIchardC and welcome to the forum!
Without any sample data to go on, this is a general approach I would try:
Strings To Document node to convert Source to a KNIME document, making sure to configure the node to apply the Category metadata
Bag Of Words Creator to… create a Bag of Words
TF node to calculate the absolute term frequency
Then some subsequent aggregation and joining to compare to your original list of terms, but exactly how this is done will depend on the format of your data. If you have a small example dataset I could try to build a toy workflow for you to check out.
Thanks so much. I uploaded some sample data here
I started with an RSS Flow, which is eventually how I will end. Cheers.
Hi @RIchardC and welcome to the KNIME forum
I tried the link to get the data in your last post but doesn’t seem to work. Could you please check and post it back ?
Thanks & regards,
Thanks for the data. Attached here you’ll find a workflow solving this using two different solutions, the one suggested by @ScottF (thanks Scott ) and another one based on classic KNIME nodes with eventually comparison of results. The frequencies are calculated per document. If you need the frequencies globally then the same workflow could do it but you need change the grouping of the groupby node (then do it only based on column “description (filtered)_SplitResultList” and not on “Doc_Index”) or to aggregate the documents first in a single row (what suits best to you).
20210512 Pikairos Simple question about Term Frequency.knwf (841.3 KB)
Please get back in touch if more explanation is needed.
Hope this helps.
This is fantastic. I love how I can study a Knime flow step by step and it makes perfect sense.
Now that I see how it’s done, I see a flaw in my question, which is that all my dictionary terms were one word. In reality, I’ll need to search for multi-word terms, e.g. “Boris Johnson”.
Since the join node is where the dictionary comes into play, I’ve studied the docs for that, as well as Cell Replacer, but I’m not seeing a way forward.
If I were programming this in a different language, my instinct would be to iterate through each dictionary item with a “contains” test on the description column. If it makes it any easier, I don’t want to count how many times the dictionary term is mentioned in each article – I only need to count how many articles mention the term.
Is there a way for Knime to do that?
I found the trick of using the Dictionary Tagger to create multi-word tags. Then I realized I needed Document Frequency rather than Term Frequency.
Thanks again for the help.
Excellent news !
Could you please upload your workflow solution so that other people can better understand it and benefit ?
Here’s the end result, plus I added a Date filter.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.