Workflow problem: Naives Bayes doesn't predict despite 96% model accuracy

Hello everyone,

I have a text classifcation problem and used preprocessing techniques combined with a Naive Bayes Learner to get a 96 % accuracy in the X-Validation classification. In the X-Validation everything looks fine and almost everything is classified correctly, BUT for my unseen (and very similar data) it outputs only the most oftenly occuring class over and over again. What did I do wrong?

The input csv-file contains labeled as well as unlabeld rows and looks like this:

Unique ID # Column with some text # Class / "?" or unlabeled datasets
1 #This is a test             #TESTLABEL
2 #Descripton of a topic # TOPIC
3 #Another test              #TESTLABEL
4 #Similar Topic              # ? (unlabeled, but should be predicted as class "TOPIC")

First I import the file, convert the strings to text, create bag of words, do some preprocessing, calculate the absolute term frequency, define the category to class.

After that I create a document vector and use the "Document Data Extractor" to add the classname from the input csv.

Up to that point of the workflow the labeled as well as the unlabeled data is processed by the same workflow. The two row filters then seperate the training data from the unlabeld data.

The training data is then fed into the Naive Bayes Learner and the unlabeled data to the naives Bayes Predictor. The labeled data also is fed into a X-Validation Loop where the same Naive Bayes Learner and predictor is active. The resulting score is a 96 % accuracy.
Unfortunately like I said, ALL unlabeled datasets are labeled like the most occuring class.
In the example above it would be "TESTLABEL" for thousands of rows, although it should be an easy task to label this as f.e. as class "TOPIC".

What did I do wrong? What is the minimum workflow required to achive this task? Can I get rid of some nodes in order to make it work?

Many thanks in advance

Stefan
 

 

 

Hi Stefan, 

That certainly looks suspect.  Can you post an example of your workflow or at least a screenshot of the contents of your cross validation node?

Aaron

Hi Aaron,
thanks a lot for your reply. In desperation of not finding the error after days of disappointing trial and error I finally deleted the complete workflow, added an id to my input data, created the whole workflow again and it worked.

I think knime is a genious open source software with endless power packaged into it, but there are definitely some rabbit holes you can easily get lost for several days if something doesn’t work in such an highly developed expert software. To figure out why something doesn’t work as expected (especially wrong data types) is really hard. For example the data working with the knime naive bayes doesn’t work with the weka naive bayes.
It returned that it cannot handle string attributes. Like recommended in the forum I added a domain calculator and added ALL elements(including Document and Document class) to the include list. I removed the restriction of number for possible values and let it run.

Now the learner finished without error message, but the Weka predictor stated: "WARN Config Could not write DataCell: ""
After this it ends with ERROR 7) Unable to clone data at port 1 (Weka model): null

That’s something I don’t understand at the moment but I hope after buying some books and discovering more of knime this will become better one day.

Stefan

Hello everyone. I want to know that can I make my learner with a complete folder having list of files in it??? Actually I want to make one with bytes of files. Is there any way I just give them the folder path and it fetch all files from it?