I have an excel file consisting of 700 rows and 2 columns. The first column contains the category, the second the contents of an email message. I want to use this file to train a naive Baysian learner to classify another file with new input (unclassified).
What are the minimum components I need (apart from the file reader and the Naive Bayes nodes)? How to connect and set them up? Any help is greatly appreciated!
You for sure need the Textprocessing Extension.
What I would recommend is doing some preprocessing on the email text column.
Sometimes, especially for machine learning, it makes sense to filter punctuation, numbers, stop words, do maybe some lemmatizing, or stemming.
If you look at this example that we share on the hub: https://kni.me/w/UlQZYUQlD2_jinwF
You can see how how the whole process could look like. Just drug&drop it
The important thing is you need to use, after reading the excel table, the String to Document node to transform your data into a “Document” type. the Textprocessing nodes do not work with the “String” type, so a transformation is needed here.
Please let me know if there are further questions coming up!
Thanks, I already installed the extension but I did not know about the workflow example. It seems to work. I have compiled a stop word list for Dutch if anyone is interested.
The model works well. Now I have a list of unclassified emails in excel. What is the fastest way to classify these using the model? Copy the workflow from the beginning right through the decision tree learner? Or is there a better way?
Sounds like your case may be a good candidate for using the Document Vector Applier node. Here’s another workflow from the Text Processing course that demonstrates how it’s used:
Hi there @PeterDoomen,
welcome to KNIME Community!
If you want you can share your workflow, together with stop list, (or if data is confidential only example utilizing list) on KNIME Hub
I definitely suggest to also try the Palladian Text Classifier nodes. They are (at least) a strong baseline, and in comparison to the rather sophisticated and heavyweight Text Processing nodes from KNIME super-simple to set up (two nodes: one learner, one predictor) and fast. The preprocessing can be configured for different n-gram settings and uses an optimized NB scoring algorithm.
More details here:
We (and our customers) are using this classifier for a wide variety of text classification tasks (e.g. sentiment analysis, product classification, language identification, …). It’s for sure not the right tool if you want to win today’s Kaggle challenges where you optimize for a per mille accuracy, but definitely a pragmatic tool for real-world use cases.
In case of questions regarding the classifier – let me know
PS: I’ve built a workflow to train a simple language detection model a while ago. It’s still available here:
Thanks! That way I have two options…
@PeterDoomen maybe this upcoming free online course might be of relevance:
Text Mining Techniques
June 25, 2020 - Online
the link that you have written is not valid now. Please update the link so that I want to see the example workflow.
Thanks a lot.
Hi @metinergoktas -
I’m not sure if this is precisely the workflow Martyna linked above, but it covers the same concepts. Give it a try:
The top branch of the workflow uses the text preprocessing nodes, but maybe the rest of the workflow is of interest as well?