Document classification

Dear KNIME Community,
I’m currently writing my master’s thesis using KNIME. It is mainly about comparing different classification algorithms. The categories of emails are already predefined and it is a record with 877 emails. I created my workflow based on your sample workflow ‘Document_Classification’. But now I do not have that much time and unfortunately my professor can not help me because he did not work with KNIME. Would it be possible for to look at my workflow and say if that’s right? With the KNN and SVM algorithms I always get an error message but I do not know why that could be. I would be extremely grateful if you took the time to check my workflow. I looked at the results of the decision tree algorithm and the results are very good. Is there something wrong?

Thank you very very much in advance and kind regards,
CananMasterarbeit 2.knwf (654.0 KB)

Hi @anon33357744 -

Could you also upload your Excel input file, or is that data proprietary?

Hey Scott,

unfortunately its proprietary…the excel file has three columns the First one is the ID, than the category of the Mail and the Mail body…do you think anything is missing in mi Workflow ?
Are the results of the algorithms correct? I have to compare them with each other

Thanks ans Kind regards
Canan

That’s OK. Can you post more information about the errors you’re seeing with KNN and SVN?

Unfortunately, since you posted a reset workflow, we can’t see the results from this end - so it will be hard to offer any advice on that front.

Hey Scott,
What do you mean with the reset Workflow? Should i Upload mi workflow again?
Thanks, Canan

Is it possible to Send it to You per Mail?
Thanks,
Canan

When you export a workflow, by default, all the nodes are reset. You can skip the reset step during export, but then that would leave your proprietary data in place.

Perhaps you could post a few screenshots of your errors and results?

The first result is the result of the Random forest algorithm…


the workflow is as follows: image

Why is the result so bad?image

thank you
Canan

Hi Canan,

how many examples do you have for each of your classes and how many classes do you have?

Best, Iris

Hi Iris,

in the following list you can see the number of categories and the corresponding number of documents.

Thanks and best regards,
Canan

Email_Kategorien_Anzahl.xlsx (9.5 KB)

MA-Neueste Version.knwf (659.7 KB)

I do not know if something in my workflow is missing, or if the configurations of each node is correct.
Here are some results of the algorithms.

Decision tree:


SVM:

KNN:

Neural Network:

KNN with cosine distance:

KNN with numeric distance:

Tree ensemble:

Gradient boosted Tree:

Do you think that my workflow is complete, because on the basis of this workflow I would like to compare and evaluate the results of the algorithms.

In the literature, I’ve read that the SVM model actually always performs better than the other algorithms but the decision tree algorithm performs better in my case. Is something wrong here?

Thanks and kind regards,
Canan