classification gives wrong output

singing_bird · November 29, 2016, 8:10pm

Hi all,

I have a dataset that needs to be classified

How can i classify data?

I put the training and testing set together and I used node "partitioning" and I used bayes learner and pridector

the result is that all the testing data are classified into one class which isnot the desired result

I want the right way or another way to classify the date

thanks in advance

qqilihq · November 29, 2016, 8:35pm

If you have separate training and testing sets, there's no need to combine them and split them again.

In case your prediction is not as expected, examine the created model and whether all your relevant features were considered.

If you have a highly imbalanced dataset, you probably need to resample or define a custom threshold, depending on your needs ... but that's just guessing and cannot be answered based on your information.

singing_bird · November 30, 2016, 10:36am

can you please give me an example to get accurate classification result?

I need a sample for training and testing set and i will adjust my dataset to obtain accurate classification result

qqilihq · November 30, 2016, 12:00pm

A total different example will likely not be helpful, but you may want to look at the various examples provided by KNIME. Rather I suggest you post your workflow and highlight where you're having issues.

singing_bird · November 30, 2016, 1:19pm

/files/world/wiki/class0.png

this is the workflow .... it gives me 43.75% accuracy

singing_bird · November 30, 2016, 4:30pm

the problem is that the classifier classifies the testing set into only one class of the 3 classes

which gives inaccurate results

I have watched videos and tutorials but i don't know why inaccurate results appear

marco_ghislanzoni · November 30, 2016, 4:54pm

Dear Singing Bird,

Please understand that it is really really difficult for anyone to help further if you don't share your workflow (Export KNIME Workflow...), so we can look at your node configuration, and your data. Sharing a picture of your workflow, especially if it fully executes like in your case and there are no evident errors, does not help much.

As already pointed out by Philipp above, more often than not some characteristics of the data, together with the way the training/test sets were chosen out of them, can explain the low accuracy of a predictor.

This is why it is always a good idea to plot your data in many different ways and look at them straight in the face before attempting any further step.

Cheers,
Marco.

singing_bird · November 30, 2016, 6:25pm

this is my workflow

and thanks so much for you help mr Philip & mr Marco

knime-export.knar

marco_ghislanzoni · December 1, 2016, 2:34pm

Thanks but your workflow doesn't contain the data set. This makes it impossible to test.

Cheers,
Marco.

singing_bird · December 1, 2016, 8:24pm

this is the data set

it is small

training_data_keywords_xls.xlsx

marco_ghislanzoni · December 2, 2016, 1:58pm

Hi,

having looked at your data and to your original workflow, it is now clear to me that you are missing some crucial steps in your data preparation in relation to what you are trying to achieve.

You cannot simply submit a bunch of comma separated keywords as strings to the learner/predictor and hope they will work correctly. The learner will consider each string as a unique feature with multiple possible values, but there are not enough variations or distinct features to base any discrimination on. It is therefore not surprising that your classifier (predictor) has a low accuracy and behaves almost randomly. It doesn't have much to work on!

The right way to do it is to first assign each keyword to an own feature, then pass on the resulting feature vector to the learner/predictor. Let me explain in details with an example.

Assume you have only 3 articles with the following keywords associated to each one and the respective classes:

1: AAA, BBB, CCC --> class 1

2: BBB, DDD --> class 2

3: AAA, BBB, EEE --> class 3

You have 5 possible distinct value for the keywords (AAA, BBB, CCC, DDD, EEE) so the feature vector will have dimension 5 and will look like this for each article:

1: 1,1,1,0,0 --> class 1

2: 0,1,0,1,0 --> class 2

3: 1,1,0,0,1 --> class 3

Where 1 indicates the presence of that keyword in association with the article, 0 its absence.

Now you can use the feature vector as input to a learner node, together with the class. The node will "learn" the association of each specific value of the feature vector to each specific class. With that classification model you can run a new data set, where keywords for an article are also expressed through a feature vector, through a classifier (predictor) to predict the class of each new article.

It is important to note that the feature vector has to be built on all the data set (learning + test set) otherwise it may be incomplete.

With this in mind you should now know how to modify your workflow to have higher accuracy prediction. Even with such a limited dataset you should be able to go to around 90%.

Feel free to post here again if you get stuck.

Cheers,
Marco.

singing_bird · January 10, 2017, 7:21pm

Thank you so much Mr. Marco for your help

It works well

Thank you

I have another question please

I want to apply classification on the result of association rule

for example:

machine learning, data mining, nformation retrieval >>>> class1

social network, network security >>> class2

I want to deal with "machine learning" as one word not "machine" & "learning"

I tried the previous solution but it calculates vector for each word separately which is not required

Iwant to calacualte vector for "machine learning" as a whole

How can i do this??