A question about weightening...

Hi everybody

I want to classify some given sentences, this is a supervised learning project. The sentences are given in some predefined categories. I have applied all the rutines, but  I couldn't get accurecies more than 60% which seems to be a little low!

There is a point, given sentences in different predefined categories, haven't the same density, e.g. 59 sentences are in one catagory, 579 in the other one, 283 in the other and so on.

What are you thinking about this case? Aren't this different densities effect on the classification accuracy?And if it is, i want to ask if there is any way to weightening my catagories before supervised classification.

Thanks in advance

Hi,

classification of small pieces of text, such as sentences can be really difficult. Since the text pieces are small it is likely that useful features (words) are missing that occur in (almost) all sentences of one category but not in sentences of the other categories.

Have you tried to find out more about the distribution of words, are there words that occur in most of the sentences, are there words that occur (almost) only on sentences of certain categories? Which setences are classified wrong, which words do they contain?

Maybe character n-grams are good features on top of the word features.

The imbalanced class distribution is another problem. There is not really a way to weighten classes in learner nodes. However, you could do oversampling of data points assigned to under represented classes.

Cheers, Kilian

Hi,

oversampling is replicating certain data points in the data set to increase their "weight" or importance. Heavily unbalanced data sets (i.t.o. class distribution) can be difficult to classify. Learners may "ignore" data points of underrepresented classes due to their bias and focus only on the majority class.

Attached is an example workflow (with your data) in which a Tree Ensemble model is built on the original data set (62% accuracy, with 1 and 2 gram features) and a model is built on an oversampled data set (86% accuracy, with 1 and 2 gram features).

To load and execute the workflow the Tree Ensemble extension (labs) is required.

Cheers, Kilian

 

Dear Dr. Kilian, 

Hi

Many thanks for your helpful information and tips.

I completely understand about difficulties of sentences classification , and now i know the reason of low accuracy!

I try N-gram node, but any change doesn't happen. I saw example workflow by using N-gram strategy, but it doesn't work in my case. 

I want to ask about correct order of nodes which i should use to work with N-gram node.

And about second part of your answer, i'm a bit confused and i have no idea about how i can do that. So i want to ask again if there is any example oversampling workflow which can show me the correct order of nodes.

And my last question is about "decision tree ensemble", by using learner and predictor nodes, i catch better accuracy than using  a simple "decision tree", i want to ask about reason of this difference.

In addition, my workflow and data attached to provide more accurate information.

Thanks in advance