Classifying Documents doesn't work using Naive Bayes

josifrs · January 19, 2019, 12:23pm

Hi, I am new to KNIME and trying to build a workflow for document classification. The classifier should be a Naive Bayes model. But Naive Bayes seems not to classify my documents and just puts all documents in the class where most of the documents belong to.
Before giving the data to the Naive Bayes Learner I transform my documents to document vectors using the TF*IDF values calculated earlier. But then the Naive Bayes Learner shows following warning: “The following columns are not supported and thus will be ignored: Document, Document Vector.”

How do I have to configure the Naive Bayes Learner so it refers to Document Vector and uses the TF*IDF value to build a model for classification?

I already tried the workflow without the Document Vector, but then it refers to every single term and not the documents and the Learner builds the model based on the given Category so every single Document is classified correct.

It would be awesome if anyone has an idea to solve my problem.

DaveK · January 21, 2019, 9:38am

Hi josifrs,

could you maybe attach a workflow showing your problem?. What data type does the Document Vector column have? I think you have to convert it first in order to use it as a feature column.

Cheers,
David

josifrs · January 21, 2019, 4:30pm

Hi DaveK,

Should have included the workflow right at the beginning, sorry.

Textclassification.knwf (37.8 KB)

I think the document vector includes numerical data, at least this was my plan.

Cheers,
Josi

DaveK · January 21, 2019, 4:47pm

Hi josifrs,

the Naive Bayes Learner does not understand Document Vector columns at the moment. In order to fix your problem, you can simply use a Split Collection Column node before the Partitioning, which will create a separate column per document vector entry. Then the Learner will use these columns for the model.

josifrs · January 21, 2019, 5:58pm

Thank you really much! First problem solved.

But still all documents are classified into the same category, do you have any idea what the problem might be?
Might it be that maybe the TD*IDF isn’t significant enough?

DaveK · January 21, 2019, 6:28pm

You could try using a different classifier like Random Forest, which should work better than Naive Bayes.

Mark_Ortmann · January 22, 2019, 8:11am

Hey @josifrs

would it be possible to share your updated workflow containing the training1.txt data with us (execute your workflow, save it, and uncheck reset workflow in the export dialog)?

josifrs · January 22, 2019, 2:58pm

Hey Mark,
here it is. Unfortunately it was too big to upload it directly.
https://drive.google.com/open?id=1nPCti1oi44VUTupJ9oBoMrevz3q6K9F9

Mark_Ortmann · January 22, 2019, 5:26pm

@josifrs,

thx. On the first sight the problem seems to be related to the fact that approx 50% of the entries in the statistics table have a standard deviation equal to 0.

I’ll get back to you as soon as possible.

Mark_Ortmann · January 28, 2019, 1:43pm

@josifrs

sorry that it took me so long to get back to you.

While investigating your problem we came across a corner case that KNIME’s Naive Bayes Predictor handles slightly differently to other implementations and this took some time to adapt (please download 3.7.1 as once it is released - should be very soon).

Anyway, any other Naive Bayes implementation I tested always predicts the same class, or nothing at all . I’ll spare you the details.

You could try another classifier, as proposed by @DaveK, instead.

I’m very sorry that I can’t provide you with a proper solution to your problem.

josifrs · January 28, 2019, 3:32pm

@Mark_Ortmann

Okay, but I really thank you and @DaveK for investigating and trying to solve my problem.

So just one last question, which classifier would you recommend for this kind of problem? I tried SVM but I guess there is too much data or my computer can’t handle it because the predictor node abandones every time.

qqilihq · January 28, 2019, 4:02pm

For a no-brainer, I’d suggest to give the PalladianTextClassifier a try. You won’t need any preprocessing nodes. Just connect it to a table which provides two columns (document text and category, both as plain strings).

You can fine-tune the preprocessing and prediction settings directly in the node configuration. That means, you have a learning and prediction with basically two nodes.

https://nodepit.com/category/community/palladian/textclassifier

Explanation and some samples here:
https://www.knime.com/book/text-classifier

Sophisticated workflows with custom, domain-specific settings and more sophisticated classifiers (e.g. random forest, SVM, DL, …) will for sure give you more opportunities for fine-tuning (when you’re willing to invest time for that), but the Palladian classifier is definitely a good baseline.

PS: The internal algorithm of the PTC is basically Naive Bayes based. More details should be mentioned on above link and/or the node documentation.

system · June 2, 2023, 9:44pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.