Hello,
KNIME offers a Tex Processing feature via the labs update site. For a detailed description of the feature have a look at the Text Processing page in the labs section. In order to convert the survey text into a document you can use the String To Document node. Once the text is converted into a KNIME document you can perform standard text processing analyses. For a good start have a look at the examples available here.
Bye,
Tobias
Hi,
as Tobias already mentioned the KNIME Text Processing plugin is able to do what you are looking for. Use the "Strings to Documents" node to convert your strings into documents. Then use the "BoW" node, to create a bag of words for the document. Once the bow is created certain frequency nodes can be used to compute term frequencies. The frequency node you need is named "TF" (term frequency). Use the option "absolute" in the nodes' dialog. The TF values are the frequencies of a term in an certain document. If you want the frequencie of terms in the corpus, over all documents, you need to group ("Group By" node) over the terms and aggregate the TF values. Hope this will help you.
Cheers, Kilian
Thanks very much!
Can the Bag of Words approach be modified to group strings of text defined as records in a dictionary of phrases that are associated with a user-defined set of topics?
If not, is there a workflow to do this? For example, ...
Word / Phrase | Topic |
---|---|
Confusing | Complex |
Not Intuitive | Complex |
Not easy | Complex |
Difficult | Complex |
Complicated | Complex |
This table of Phrases and Topic Assignments would be refined and updated manually to improve results.
In case the table inserted above doesn’t render properly, the topic "Complex" would be assigned to all rows if the target field contained "Confusing, Not Intuitive, Not easy, Difficult, Complicated."
A bag of words works best when word order is not important. In your case it appears that word order is important and groups of up to two words appear to be meaningful. Ngrams may work better than bag of words, set them on words (not characters) and n max to 2 and n min to 1. You would then extract unigrams and bigrams from the documents, each term or term group can be matched against the dictionary. Is that what you meant?
Geo
Yes. n-grams sounds like it will work for part of my project. I will give it a try. Thank you.
Scott