Categorizing Unstructured Text

I would like to assign topics to unstructured text submitted in the form of Software Support Tickets. The overall goal is to determine what each ticket is "about."

We receive about 3,500 tickets each month. The field we want to analyze contains an average of 50 characters in unstructured text. Because the data is unstructured, customers describe similar topics using different words and phrases. The challenge (obviously) is grouping unstructured text into one or more topics in order to rank topics by frequency. 

In this situation, could KNIME suggest common topics by evaluating the text on its own (without a training set?) Is this realistic? Which node (s) would we use in this case?

If we wanted to create our own categorization rules, are there a node (or nodes) that would allow us to assign custom "topics" or categories based on the presence of words or phrases we define? e.g. Tickets containing string 1, string 2, string 3 would be assigned to Topic A where strings and topics are defined by us? If so, which nodes would we use to do this?

Finally, are there books containing sample workflows and results explaining the analysis and categoization of unstructured text to identify patterns (not just sentiment) in more detail? I've purchased KNIME Beginner's Luck by Dr. Rosaria Silipo and working my way through it but I'd also like to read books or papers that focus specifically on unstructured text analysis.

Thanks in advance for any guidance and suggestions.


Categorization of text can be done in two ways, supervised and unsupervised. The unsupervised approach is similar to clustering. Here you don't know exactly how many clusters/topics are in the data. To do unupervised topic extraction use the Topic Extractor node. Like in the k means you ned to specify a number of topics that you want to extract.

For supervised topic assignment you need training data. This is a classification based approahc where you train a predictive model to predict tha class/topic of a document based on the contained terms. A node that creates rules to classify documents (e.g. if A or B and C is contained then class A) would be a decision tree learner.

Creation of manual rules can be done e.g. by the Rule Engine node. Therefore you need to transform the documents into vectors before and apply the rules on these vectors.

The only white papers about text processing that are online are the following two:

Please note that are a bit out of date. Unfortunately we donät have a book so far about text processing.

Cheers, Kilian