Influence of input variables on a classification output

Hi,

I would like to know the influence of input variables on a classification output variable. Consider this small example.

I have a number of input variables. And I have binned the data set quality as <99 and >99.

My target  would be to know which input variables affect the quality. Which method would be suitable?

Material Temp1 Temp2 Temp3 Cr Qualty% Quality(binned)
C10 1019 1025 1030 0.692 99.5 >99
C20 1100 1124 1156 1.068 98 <99
C10 990 1004 1160 0.17 99.2 >99
C15 1008 1044 1150 0.822 96 <99
C20 1026 1038 1160 1.42 91 <99
C10 1022 1041 1130 0.911 64 <99
C10 990 1022 1140 0.176 99.1 >99
C20 1038 1094 1130 0.556 89 <99
C30 969 973 1160 1.07 96.4 <99

I would like to get a method so as to know which input variables are decreasing/increasing the quality parameter?

Thanks

I would storngly recommend having a look at the slides of Dean Abott who gave a talk at this year's KNIME Summit. It contains several ideas about variable importance assesment. The slides are linked here.

Hi,

Thanks for the slides. I am new to knime. And from the slides its diificult for me to get the complete idea. Is there any example workflow available?

Thanks

Hi there,

I have to admit, that I've done most of my variable assesment until today directly in code and not in KNIME, and I don't know whether there are some ready-to-use example workflows out there. 

You say, you want to measure which input variables affect the "quality". Do I get you right, that you want to measure the importance of your individual input features considering the classification?

Some assorted ideas:

  1. You might want to try traning your classifier on one individual input variable alone at a time and measure the performance. The classifiers which perform best are obviously those with the "best" input variable.
  2. KNIME's "Random Forest Learner" node provides an output table which gives you information about where in the tree an input variable occurs. The idea is, that input variables which occur on top (i.e. close to the root) of the tree, are more "important" because they make a good split for most of the data. (it's available in the second output port titled "Attribute Statistics")
  3. The Palladian nodes have a dedicated "InformationGain" calculator node, which basically uses the same measure which is (often) used in decision trees and provide you with an output table which will give you the IG value for each input variable.

Hope this gives at least some general ideas.

Philipp

Hi you can check at this link which measures variable importance. To be sincere I have not tried it by myself .

"Variable Importance in Prediction (Classification or Regression) Molels"

http://bit.ly/23t8QcO

Hope it helps

 

 

Thank you very much! I will have a look and try the different approaches

Hi,

There is one example workflow given in the KNIME public server named variable importance.

after running the workflow I have this as output. i would like to know what does the exactly show. Does this mean Universe_1_3 has the maximum amount of errors in the prediction model??? And then how do we get to know which input variable has the maximum influence on the output classification??

Thanks

 

With this small number of features (as shown in your example), I wonder whether you are really looking for variable assessment methods or whether they would even be useful. Such methods usually guide you when you do not have any backing theory / domain knowledge or when there are simply too many features to select manually from. In any other case, it is IMO better to perform the selection manually.

However, if you are rather looking for a way to assess the importance of each input on the target, while keeping the other inputs constant, then a good approach might be to use Logistic Regression Learner and to analyze the coefficients for each variable.

You could also directly estimate the target "Quality%" as a continuous variable instead of a binary variable. The natural choice would then be Linear Regression Learner.

Hi Geo.

The example I have shown is just for reference. I am actually working on a large data set with lot of variables.

Thanks for your suggestion.

Ok, in that case I would add that you might try dimension reduction such as PCA on your inputs with a linear regression thereafter. Another approach to explore.