I need some help to manage a classification problem with low signal
Case : I’ve got ~ 9 000 observables and a signal ~3.3% (300 cases)
Most of my variables (>20) are categorical , coming from Quiz so the values are YES/NO and UNKNOWN in the case of no answer
The more filled variable are completed up to 65% but most of them are completed less than 5%
It is a good idea to :
-Use a default label UNKNOWN for NA values (i plan tu use a ColumnLoop and a RuleEngineNode) ?
-Use dummies variables (node OneToMany giving 2 boolean for each variable) -Use LowVarianceFilterNode to remove the lowest filled variable,
-Use the CorrelationFilterNode to remove variables to much correlated to other
At the end if a keep 50 variables for a 300 cases Signal, it wont be work
I read in articles that most of algorithm needs at least 10 to 20 observables per variable
Do you have other tricks to manage that ?