Classifying Documents doesn't work using Naive Bayes

Hi, I am new to KNIME and trying to build a workflow for document classification. The classifier should be a Naive Bayes model. But Naive Bayes seems not to classify my documents and just puts all documents in the class where most of the documents belong to.
Before giving the data to the Naive Bayes Learner I transform my documents to document vectors using the TF*IDF values calculated earlier. But then the Naive Bayes Learner shows following warning: “The following columns are not supported and thus will be ignored: Document, Document Vector.”

How do I have to configure the Naive Bayes Learner so it refers to Document Vector and uses the TF*IDF value to build a model for classification?

I already tried the workflow without the Document Vector, but then it refers to every single term and not the documents and the Learner builds the model based on the given Category so every single Document is classified correct.

It would be awesome if anyone has an idea to solve my problem.

Hi josifrs,

could you maybe attach a workflow showing your problem?. What data type does the Document Vector column have? I think you have to convert it first in order to use it as a feature column.

Cheers,
David

Hi DaveK,

Should have included the workflow right at the beginning, sorry.

Textclassification.knwf (37.8 KB)

I think the document vector includes numerical data, at least this was my plan.

Cheers,
Josi

Hi josifrs,

the Naive Bayes Learner does not understand Document Vector columns at the moment. In order to fix your problem, you can simply use a Split Collection Column node before the Partitioning, which will create a separate column per document vector entry. Then the Learner will use these columns for the model.

Thank you really much! First problem solved.

But still all documents are classified into the same category, do you have any idea what the problem might be?
Might it be that maybe the TD*IDF isn’t significant enough?

You could try using a different classifier like Random Forest, which should work better than Naive Bayes.

Hey @josifrs

would it be possible to share your updated workflow containing the training1.txt data with us (execute your workflow, save it, and uncheck reset workflow in the export dialog)?

Hey Mark,
here it is. Unfortunately it was too big to upload it directly.
https://drive.google.com/open?id=1nPCti1oi44VUTupJ9oBoMrevz3q6K9F9

@josifrs,

thx. On the first sight the problem seems to be related to the fact that approx 50% of the entries in the statistics table have a standard deviation equal to 0.

I’ll get back to you as soon as possible.

@josifrs

sorry that it took me so long to get back to you.

While investigating your problem we came across a corner case that KNIME’s Naive Bayes Predictor handles slightly differently to other implementations and this took some time to adapt (please download 3.7.1 as once it is released - should be very soon).

Anyway, any other Naive Bayes implementation I tested always predicts the same class, or nothing at all :slight_smile:. I’ll spare you the details.

You could try another classifier, as proposed by @DaveK, instead.

I’m very sorry that I can’t provide you with a proper solution to your problem.

@Mark_Ortmann

Okay, but I really thank you and @DaveK for investigating and trying to solve my problem.

So just one last question, which classifier would you recommend for this kind of problem? I tried SVM but I guess there is too much data or my computer can’t handle it because the predictor node abandones every time.

For a no-brainer, I’d suggest to give the PalladianTextClassifier a try. You won’t need any preprocessing nodes. Just connect it to a table which provides two columns (document text and category, both as plain strings).

You can fine-tune the preprocessing and prediction settings directly in the node configuration. That means, you have a learning and prediction with basically two nodes.

https://nodepit.com/category/community/palladian/textclassifier

Explanation and some samples here:
https://www.knime.com/book/text-classifier

Sophisticated workflows with custom, domain-specific settings and more sophisticated classifiers (e.g. random forest, SVM, DL, …) will for sure give you more opportunities for fine-tuning (when you’re willing to invest time for that), but the Palladian classifier is definitely a good baseline.

PS: The internal algorithm of the PTC is basically Naive Bayes based. More details should be mentioned on above link and/or the node documentation.

3 Likes