Getting percentage of word related to a dictionary

Hey there!

I am working on a project in which I have about 13,000 Instagram captions (each with a unique ID). I also have an excel with 194 preset dictionaries (for ex. “trust” with 30 words, “family” with 25 words etc). Each column in the excel represents a dictionary.
For each caption, I need to know which is the percentage of words related to each of the 194 dictionaries (for ex, the first caption contains 4% of words realted to “family”).

Thanks in advance for your help!

Luca

Hi @glcasciorizzo -

Can you post some sample data here, both a few different captions, and a selection of the dictionary categories? Assuming the data is not confidential, this would entice people to actually take a stab at your problem :slight_smile:

1 Like

Yes, here they are:

captions.xlsx (10.1 KB)
dictionaries.xlsx (193.3 KB)

Thanks!!

1 Like

Hi @glcasciorizzo -

Here is one way to approach it, using nodes from the KNIME Textprocessing extension. Let me know if you have any questions.

DictionaryTermCounterExample.knwf (219.6 KB)

Sample Results (I took the liberty of adding a testing row to your original captions dataset:

2022-09-01 11_23_47-Join result - 3_24 - Joiner (Rejoin on ID)

4 Likes

Thanks so much @ScottF.
Aprreciate it a lot!

1 Like