Can Knime Be Use for Text Analysis of Open-Ended Survey Questions?

VikR · December 7, 2012, 11:26am

I do survey market research. We have questions such as, "What are the most important subjects to be discussed?" The results are contained in a tab-text or csv file where field 1 is the respondent ID, and field 2 is the text of the response. Example in csv format:

294,technology

296,"poverty, pollution, laziness, video gaming"

315,Ecology

322,"the economy, discrimination, poor schools"

324,"global warming, terrorism, national debt"

I'd like to analyze the data to find out the number of total respondents who mention "global warming", or the number who mention "national debt." Is this possible to do using Knime, and if so, is there a link to a web page detailing how to do it?

I have looked at the docs and run the software, but so far I haven't seen the correct procedures.

Thanks very much in advance to all for any info.

tobias.koetter · December 7, 2012, 1:46pm

Hello,

KNIME offers a Tex Processing feature via the labs update site. For a detailed description of the feature have a look at the Text Processing page in the labs section. In order to convert the survey text into a document you can use the String To Document node. Once the text is converted into a KNIME document you can perform standard text processing analyses. For a good start have a look at the examples available here.

Bye,

Tobias

kilian.thiel · December 7, 2012, 2:45pm

Hi,

as Tobias already mentioned the KNIME Text Processing plugin is able to do what you are looking for. Use the "Strings to Documents" node to convert your strings into documents. Then use the "BoW" node, to create a bag of words for the document. Once the bow is created certain frequency nodes can be used to compute term frequencies. The frequency node you need is named "TF" (term frequency). Use the option "absolute" in the nodes' dialog. The TF values are the frequencies of a term in an certain document. If you want the frequencie of terms in the corpus, over all documents, you need to group ("Group By" node) over the terms and aggregate the TF values. Hope this will help you.

Cheers, Kilian

VikR · December 7, 2012, 8:33pm

Thanks very much!

sccardais · December 7, 2015, 2:43pm

Can the Bag of Words approach be modified to group strings of text defined as records in a dictionary of phrases that are associated with a user-defined set of topics?

If not, is there a workflow to do this? For example, ...

Sample Topic / Phrase Assignment

Word / Phrase	Topic
Confusing	Complex
Not Intuitive	Complex
Not easy	Complex
Difficult	Complex
Complicated	Complex

This table of Phrases and Topic Assignments would be refined and updated manually to improve results.

In case the table inserted above doesn’t render properly, the topic "Complex" would be assigned to all rows if the target field contained "Confusing, Not Intuitive, Not easy, Difficult, Complicated."

Geo · December 7, 2015, 10:20pm

A bag of words works best when word order is not important. In your case it appears that word order is important and groups of up to two words appear to be meaningful. Ngrams may work better than bag of words, set them on words (not characters) and n max to 2 and n min to 1. You would then extract unigrams and bigrams from the documents, each term or term group can be matched against the dictionary. Is that what you meant?

sccardais · December 8, 2015, 11:56am

Geo

Yes. n-grams sounds like it will work for part of my project. I will give it a try. Thank you.

Scott