I have a text classifcation problem and used preprocessing techniques combined with a Naive Bayes Learner to get a 96 % accuracy in the X-Validation classification. In the X-Validation everything looks fine and almost everything is classified correctly, BUT for my unseen (and very similar data) it outputs only the most oftenly occuring class over and over again. What did I do wrong?
The input csv-file contains labeled as well as unlabeld rows and looks like this:
Unique ID # Column with some text # Class / "?" or unlabeled datasets
1 #This is a test #TESTLABEL
2 #Descripton of a topic # TOPIC
3 #Another test #TESTLABEL
4 #Similar Topic # ? (unlabeled, but should be predicted as class "TOPIC")
First I import the file, convert the strings to text, create bag of words, do some preprocessing, calculate the absolute term frequency, define the category to class.
After that I create a document vector and use the "Document Data Extractor" to add the classname from the input csv.
Up to that point of the workflow the labeled as well as the unlabeled data is processed by the same workflow. The two row filters then seperate the training data from the unlabeld data.
The training data is then fed into the Naive Bayes Learner and the unlabeled data to the naives Bayes Predictor. The labeled data also is fed into a X-Validation Loop where the same Naive Bayes Learner and predictor is active. The resulting score is a 96 % accuracy.
Unfortunately like I said, ALL unlabeled datasets are labeled like the most occuring class.
In the example above it would be "TESTLABEL" for thousands of rows, although it should be an easy task to label this as f.e. as class "TOPIC".
What did I do wrong? What is the minimum workflow required to achive this task? Can I get rid of some nodes in order to make it work?
Many thanks in advance