This may be disceptively simple, but all I want to do is to set up a counter for the number of times a selected word or selected list words appears in the body of a series of documents.
I don't want to count any other words but those I want to count.
Can someone please show me how this is done?
I understand this is kind of like tagging with a dictionary or a wildcard.
But, after setting up those functions, I can't seem to find a way to get a straight count.
My data is in CSV. I have defined the body only as the document.
The problem I have with the dictionary and wildcardtaggers is they seem to ambiguate with other categorical tags in NEP/unknown and so it fails to isolate the words that I am trying to target.
After the count happens I want to append the count to the corresponding row of the corresponding document.
Thanks for any help here.
to count specific words in a corpus or in documents the combination of Dictionary Tagger and TF node is the right one. If you only want exact matches of dictionary terms and terms in the documents to count check the checkbox "exact match" in the dialog of the Dictionary Tagger. If wildcards matches are ok too then the Wildcard Tagger may be of better use.
To filter out the untagged terms easily later on, check the checkbox "set named entities unmodifieable" in the dialog of the tagger node.
After tagging create a bag of words and filter this bow by using the "Modifieable Term Filter". All terms which have not been tagged (and thus set unmodifieable) by the tagger before hand will be filtered out. You will end up with a bow containing only the terms you are interested in.
Now use the TF node to count the term frequencies. These frequencies will by added in an extra column beside the corresponding term and the document the term is contained in. To count the overall term frequency in the corpus use the Group By node.
Attached you find an example workflow, counting some terms of a dictionary in some tripadvisor reviews.