Document classifiaction 50% classified by pos/neg and 50% are unclassified -> how to predict it?

GWI_Student · May 10, 2016, 10:54pm

Dear KNIME community,

i am new in this Forum and looking for help. It is my first post in life.

short description about i want to build:

I do have 7.000 posts from social media in an excel file. I added a column called "issue".

I read 3.500 posts and classified as "positiv" the ones which I considered as being relevant (is an issue) and as "negative" the ones not relevant (is not an issue).

So i build a whole workflow for the classified posts (3,500) and found out the SVM had a better result than the decision tree or k-neares neighbour.

The other 3.500 posts i didn't classify, bcs. i thought the system would learn from the other 3.500 posts and predict if there is an issue or not.

BUT:

If i include the other 3.500 unclassified posts,they get the class "undefined" and that is a class for the system too.

So the confusion matrix shows positiv, negativ and undefined values.

I want the system to use the learned/predictited data and apply that learning in the undefined data.
Does anyone know which knodes i need to include?

kind regards Torsten

knime_view.png

Geo · May 10, 2016, 11:48pm

You need to apply Row Splitter before learning and predicting - klassifiziert goes to Learning node, undefined goes to Predictor node. There is also a rule-based Row Splitter if you prefer to define the split on the fly.

You also need to take care that the term selection (including TF calculation) is only performed on the labelled observations - so you might need an additional Row Splitter in the middle of your workflow, after the stemming. Otherwise, terms from the undefined set risk leaking into the term dictionary.

Why not give Naive Bayes a try ? It usually performs well on sparse data and constitutes at the very least a good baseline to compare against. For k-NN, I'm curious whether you've applied it on the text column (with a string distance function) or on the sparse numeric features (with a numeric distance function)? While k-NN applied on the single string column yields good results, its performance may not be so great when applied on too many numeric features (-> curse of dimensionality). Finally, a single decision tree may not provide the big aha in text classification but tree ensembles (e.g. random forest) can be given a try.

GWI_Student · May 11, 2016, 7:02pm

Dear Geo,

thanks for your help.

Your first advise i practised already and i understood the confusion matrix better. So its working.

Your second idea is great! It is exactly what i wanted, but it doesn't work. The issue is the document to vector node. The classified documents having other vector (columns) and outcome as the unclassified ones. Which is logic, but all predictors doesn't like that and doen't run bcs. of columns the learned from the classified documents. Tthere are no settings to ignoreif there no existing columns.

Does anyone knows how to solv that?

Geo · May 11, 2016, 8:13pm

Here is what you'll need to do:

- the grouped by term table (which contains total TF by term -> term dictionary) should serve as input for the prediction set: the easiest way is to transform your prediction set into BoW (after stemming), then perform a Right Outer Join (right = your prediction set) on the Term variable, only keep the terms that matched with the term table, calculate any type of term weight you need and convert to Document vector;

- in parallel, convert the term dictionary into a reference column list that you can use as input for the bottom port of Table Validator (Reference) - in the latter check for column existence and insert missing columns and link the above Dcoument vector to the top port - this way you ensure that every term column from your training set will exist in the prediction set. Recode missing values on newly inserted terms by 0 using the node Missing Value;

Now the prediction set will have the same feature structure than the training set.

kilian.thiel · June 2, 2016, 8:08pm

Here is an example workflow about how to adjust the feature space coming from the training set for the test set. Make sure to apply the same preprocessing chains to both document sets. I hope this helps.

Cheers, Kilian

adjustingfeaturespace.zip