Text classification starting from excel

PeterDoomen · May 4, 2020, 1:44pm

I have an excel file consisting of 700 rows and 2 columns. The first column contains the category, the second the contents of an email message. I want to use this file to train a naive Baysian learner to classify another file with new input (unclassified).

What are the minimum components I need (apart from the file reader and the Naive Bayes nodes)? How to connect and set them up? Any help is greatly appreciated!

Peter.

Martyna · May 4, 2020, 2:51pm

Hi @PeterDoomen

You for sure need the Textprocessing Extension.

What I would recommend is doing some preprocessing on the email text column.
Sometimes, especially for machine learning, it makes sense to filter punctuation, numbers, stop words, do maybe some lemmatizing, or stemming.

If you look at this example that we share on the hub: https://kni.me/w/UlQZYUQlD2_jinwF
You can see how how the whole process could look like. Just drug&drop it
The important thing is you need to use, after reading the excel table, the String to Document node to transform your data into a “Document” type. the Textprocessing nodes do not work with the “String” type, so a transformation is needed here.

Please let me know if there are further questions coming up!

Best,
Martyna

PeterDoomen · May 5, 2020, 9:37am

Thanks, I already installed the extension but I did not know about the workflow example. It seems to work. I have compiled a stop word list for Dutch if anyone is interested.

PeterDoomen · May 5, 2020, 2:39pm

The model works well. Now I have a list of unclassified emails in excel. What is the fastest way to classify these using the model? Copy the workflow from the beginning right through the decision tree learner? Or is there a better way?

ScottF · May 5, 2020, 4:19pm

Sounds like your case may be a good candidate for using the Document Vector Applier node. Here’s another workflow from the Text Processing course that demonstrates how it’s used:

ipazin · May 6, 2020, 3:35pm

Hi there @PeterDoomen,

welcome to KNIME Community!

If you want you can share your workflow, together with stop list, (or if data is confidential only example utilizing list) on KNIME Hub

Br,
Ivan

qqilihq · May 6, 2020, 5:28pm

Hi Peter,

I definitely suggest to also try the Palladian Text Classifier nodes. They are (at least) a strong baseline, and in comparison to the rather sophisticated and heavyweight Text Processing nodes from KNIME super-simple to set up (two nodes: one learner, one predictor) and fast. The preprocessing can be configured for different n-gram settings and uses an optimized NB scoring algorithm.

More details here:

We (and our customers) are using this classifier for a wide variety of text classification tasks (e.g. sentiment analysis, product classification, language identification, …). It’s for sure not the right tool if you want to win today’s Kaggle challenges where you optimize for a per mille accuracy, but definitely a pragmatic tool for real-world use cases.

In case of questions regarding the classifier – let me know

– Philipp

PS: I’ve built a workflow to train a simple language detection model a while ago. It’s still available here:
https://www.knime.com/book/text-classifier

PeterDoomen · May 8, 2020, 6:09am

Thanks! That way I have two options…

mlauber71 · June 20, 2020, 10:59am

@PeterDoomen maybe this upcoming free online course might be of relevance:

Text Mining Techniques

June 25, 2020 - Online

https://www.knime.com/about/events/text-mining-techniques-online-june-25-2020

metinergoktas · August 24, 2020, 6:22am

Hi Martyna
the link that you have written is not valid now. Please update the link so that I want to see the example workflow.
Thanks a lot.

ScottF · August 24, 2020, 3:49pm

Hi @metinergoktas -

I’m not sure if this is precisely the workflow Martyna linked above, but it covers the same concepts. Give it a try:

The top branch of the workflow uses the text preprocessing nodes, but maybe the rest of the workflow is of interest as well?

system · June 2, 2023, 9:41pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.