Classification with low signal (Dummies and variables selection)

I need some help to manage a classification problem with low signal

Case : I’ve got ~ 9 000 observables and a signal ~3.3% (300 cases)

Most of my variables (>20) are categorical , coming from Quiz so the values are YES/NO and UNKNOWN in the case of no answer

The more filled variable are completed up to 65% but most of them are completed less than 5%

It is a good idea to :

-Use a default label UNKNOWN for NA values (i plan tu use a ColumnLoop and a RuleEngineNode) ?

-Use dummies variables (node OneToMany giving 2 boolean for each variable) -Use LowVarianceFilterNode to remove the lowest filled variable,

-Use the CorrelationFilterNode to remove variables to much correlated to other

At the end if a keep 50 variables for a 300 cases Signal, it wont be work

I read in articles that most of algorithm needs at least 10 to 20 observables per variable

Do you have other tricks to manage that ?


Hi Fabrice JOURDAN, 

You might want to use a Missing Value Column Filter node to remove all columns from the input table which contain more missing values than a certain percentage. 

You could also try to predict missing values for the "most uncompleted" with the "most completed" variables using the Logistic Regression or Naive Bayes classifiers, however you should keep in mind that in this case the predicted variable will be highly correlated with the independent ones. 

Also, having a small number of records in order to reduce overfitting it could be beneficial to use an ensemble method, for example the Random Forest classifier and use possibly simple models (limit tree depth in case of Random Forest). If you're using Logistic Regression for predicting your final class values, you can use regularization in order to train an optimal less complex model. 






Thank's Anna i'll try