Document Classification

Hey everbody,

I am a litlle bit in trouble with understandig the document classification workflow on the example server.

Is there a difference between document class and category?

In my case I have about 8 topics. For every topic I got about 25 test documents where the category is known.

Now I have many other documents where the category is unknown. So I want the workflow to assign the document to the right category.

I already used the example workflow and I tried 2 categories first (like the example). Everything worked. But what next? Where do I put the documents with the unknown category now?

I thought the example workflow from the server is just the "learner" but now I want to predict? I thought I just need to put the documents at the second input of the decision tree predictor node. (where the partioning node is conneted in the example workflow.) but it didn't work?

 

Maybe I don't understand the workflow correctly?

 

I am looking forward to hearing from you soon! :)

 

Thank you!

Vanessa

 

Hi Vanessa,

you need to preprocess the second set of documents (unlabeled) the same way as the first set (labeled, used for training). Also you need to create the document vectors the same way in order to get the same feature space. The easiest way to do this is to first create only one set of documents, labeled and unlabeled, preprocess this set, create the vectors and then split it up into labeled data (for training and testing) and unlabeled for prediction.

Does this help?

Cheers, Kilian

Hi Kilian,

thank you for yoour help - it worked! :-) I tried it with 3 sets of labeled Data  (appr. 25 documents per Set) and one small set with unlabeled data. So I checked if the prediction is right. Unfortunately the prediction is not allways correct.

What can I do to optimize the learning part to get right predictions? Is the only possibility to increase the training data or is there another way to get better predictions?

 

Another question, is there a possibility to select individual words and give them a higher weight? So for example if a special word appears it's allways a specifiv category?

 

Thank you Everybody!

 

Cheers, Vanessa

Hi Vanessa,

if your model is always correct or your test data (100% accuray) I would be very suspicious about that ;-). To optimize your models you can e.g. use different models / learners. I recommend the Tree Ensemble node. With this node you can create ensembles of decision trees including bosting and bagging. Also you could start to optimize the parameters of the learner you are using. What model / learner are you currently using?

Weightning features for a learner node is not possible. You could indirectly do it by oversampling specific data records.

Cheers, Kilian

 

Hi Kilian, thank you very much for your help.

I tried the decision tree learner first. It works quite good for the moment.

 

I have another question:

the following scenario: I have 4 categories the predictor should assign the documents to the relevant category. But it is not compulsory that a document fits to one of these categories. Is there a possibility to set a minimum value of similarity so an document isn't assigned to a sepcific category? So the category is something like "other topics".

 

Thank you for your help! :)

 

Cheers, Vanessa

Hi Vanessa,

some of the predictor node can attach a column that contain the confidence of the prediction. You can filter by the confidence and assign a "don't know" class if the confidence is below a specific value. Therefore you can use the Rule Engine node.

Cheers, Kilian