I’m looking to do some text processing in KNIME but I am completely new to it.
I have a preliminary question, but may have more to follow:
- Is there a way for me to count punctuation marks and capital letters in a string?
I want to analyze text for sentiment, and I’ve found the sentiment analysis workflow, but I also want to do a series of counts, for specific phrases, specific use of words from determined categories. I want to do some statistical categorical analysis of these counts, which I may do in R-studio as I am more comfortable with it.
What I need to know is are there any workflows that I haven’t seen that contain these tasks? Also, I’d like to be able to build a table and store the numbers as I collect these counts, what is the best way of going about this? I know there are many options for table creation.
Thanks in advance!
regarding simple character counts, you can use the String Manipulation node. Among others, it has a
count and a
countChars function. To e.g. count the number of punctuation characters, you can use:
This will count the number of occurrences of any of the given characters in the second argument.
For more sophisticated features, you can use some of the “Snippet” nodes, e.g. a Java Snippet (I used this in the example below to count the number of tokens by doing a RegEx split on whitespace characters):
out_tokenCount = c_text.split("\\s+").length;
Here’s an example workflow which counts several metrics:
That was extremely helpful, thank you.
Now, is there a way I can write my counts to a table with the rest of my data?
The data I’m collecting has columns for news article id, headline, dateline, url, and content. I want my final table to have all of these columns alongside counts for certain specs I’ve gotten using KNIME.
Is there a way I can write these values to one big table? Better yet, is there a way I can write these results to my excel file?
I have a couple wordsets I would like to search the documents for and get counts for occurrences of specific phrases.
Is there a way I can establish a node to cycle through the file containing the terms I would like to count and reference my documents to then give the number of times every phrase occurs?
I think I may be having problems with my strings to document node, could you please look this over? After I use the strings to document node my content appears as “”", which I don’t think is what I want. Please help!
The workflow I’m talking about is labeled “Sentiment analysis R2” but all of the strings to document nodes
text analysis pbt1.knwf (46.8 KB)
There is no data in your WF. Could you re-post it?
Here’s the workflow where I’m trying to conduct the same procedure as in the lexicon based sentiment analysis workflow. I’m not receiving counts that make sense. Here’s the file:
upload 1.knwf (56.1 KB)
Could you please tell me if you determine what’s going wrong with it?
On another note, I have three categories of terms. I have the lists stored in excel files. I want to count the occurrences of those terms in my documents and total them for each category Is this completed using the dictionary tagger as in this example workflow or otherwise?
Please let me know as soon as you can!
Hi @pkaren626 -
I looked at your workflow, but I can’t see what’s strange about the counts because you haven’t included the R2.xlsx input file. It’s understandable if the data is proprietary and you can’t share it, so I just ran the workflow using the usual IMDB data and I’m not noticing anything odd.
But maybe I can help just by talking in general terms about what the nodes do.
All the Dictionary Tagger does is assign a particular tag to words in a document, based on the list you provide. This doesn’t have to be sentiment - it could be anything. We provide several different tag types for you to select here such as parts of speech, named entities, and others.
The Bag of Words Creator breaks the document down into its constituent words (really, tokens) and their associated terms.
The TF node is doing the word frequency calculation across each document
The aggregation metanode uses a combination of nodes to pull out only the tagged words, and count those.
So to answer your last question about counted tagged words, you’ll want to look at the output of the aggregation metanode.
Does that help?