I am working on my data set kind of prediction and I am pretty confused with Backward Feature Elimination. My data set has 1140 records, and 33 attributes, 3 attributes are integers and others are strings. I perform linear correlation on my data set and use correlation filter with the threshold value 0.36 and eliminated 4 attributes(2 attributes are strings, 2 are integers) columns, then I use Backward Feature Elimination, with Backward Feature Elimination Start, cross-validation, Backward Feature Elimination End and Backward Feature Elimination Filter. In my dataset class column is Alco, which represents does student take alcohol in large amounts, that attribute has two values (P - positive and N - negative), in cross-validation, in fact in X - Partitioner I used Random sampling, and once I also used stratified sampling with class column Alco, then in Decision Tree Lerner I used class column Alco (goal is to make a prediction whether students could become alcohol addict because in classification only one index can be predicted I used my class column Alco) than in quality measure I used Gini index, in pruning method MDL, also checked Reduced error pruning, Min number records per node 2, then Min number to store for view 10,000, then checked Average split point, then the number of threads 4, then checked skip nominal columns without domain information, then checked Binary nominal points, #Max nominal 10, and lastly checked Filter invalid attribute values in child nodes. I do not change anything in Decision Tree Predictor, but in X-Aggregator in target column, I choose Alco and in the predicted column, I choose predicted(Alco), also in Backward Feature Elimination End I choose target column Alco and predicted column predicted(alco). Finally, after loop finished I have got in Backward Feature Elimination Filter 23 outputs of error 0.066, and two outputs with error value 0.076 and one output with error 0.067 and also one output with error 0.069.
My questions are:
1. In Backward Feature Elimination Filter I am confused with this error values. What does this error represent?
2. Is this error in Backward Feature Filter same as the error in cross-validation?
3. I have got several features with the same error 0.066 but the number of columns changes for the same error value, how to decide the number of features to choose? Do I need to use random forest with all possible outcomes for error value 0.066 and then pick the best one? for example, I have got an error with value 0.66 and no of features 3 and the same error but no of features 25, how I can decide which the number of features to
4. May somebody explain to me how cross-validation works with Backward Elimination Start and Backward Elimination End?
5. May I see trees with this particular errors from Backward Feature Elimination Filter??
6. After Backward Elimination Feature I want to use Random Forest Algorithm to build the best accuracy tree, do I need to use random forest learner and predictor together or I can use again cross-validation on my features selected from Backward Feature Filter?
7. Does cross-validation represent the random forest algorithm?
I provide links for two images one of my network and the second of my Backward Elimination Filter dialog.
Thank you in advance,
does this article help you?
If not please let me know, then I will try to answer the remaining questions.
You should know that correlation filter is only usefull to numeric features. You can not to use categorical variables to feed a correlation filter. If you want to know if a categorical variables have a relation, you must to use chi-square test that is in crosstab node.