Influence of input variables on a classification output

nakuld · April 12, 2016, 3:29pm

Hi,

I would like to know the influence of input variables on a classification output variable. Consider this small example.

I have a number of input variables. And I have binned the data set quality as <99 and >99.

My target would be to know which input variables affect the quality. Which method would be suitable?

Material	Temp1	Temp2	Temp3	Cr	Qualty%	Quality(binned)
C10	1019	1025	1030	0.692	99.5	>99
C20	1100	1124	1156	1.068	98	<99
C10	990	1004	1160	0.17	99.2	>99
C15	1008	1044	1150	0.822	96	<99
C20	1026	1038	1160	1.42	91	<99
C10	1022	1041	1130	0.911	64	<99
C10	990	1022	1140	0.176	99.1	>99
C20	1038	1094	1130	0.556	89	<99
C30	969	973	1160	1.07	96.4	<99

I would like to get a method so as to know which input variables are decreasing/increasing the quality parameter?

Thanks

qqilihq · April 12, 2016, 8:03pm

I would storngly recommend having a look at the slides of Dean Abott who gave a talk at this year's KNIME Summit. It contains several ideas about variable importance assesment. The slides are linked here.

nakuld · April 12, 2016, 9:34pm

Hi,

Thanks for the slides. I am new to knime. And from the slides its diificult for me to get the complete idea. Is there any example workflow available?

Thanks

qqilihq · April 12, 2016, 10:04pm

Hi there,

I have to admit, that I've done most of my variable assesment until today directly in code and not in KNIME, and I don't know whether there are some ready-to-use example workflows out there.

You say, you want to measure which input variables affect the "quality". Do I get you right, that you want to measure the importance of your individual input features considering the classification?

Some assorted ideas:

You might want to try traning your classifier on one individual input variable alone at a time and measure the performance. The classifiers which perform best are obviously those with the "best" input variable.
KNIME's "Random Forest Learner" node provides an output table which gives you information about where in the tree an input variable occurs. The idea is, that input variables which occur on top (i.e. close to the root) of the tree, are more "important" because they make a good split for most of the data. (it's available in the second output port titled "Attribute Statistics")
The Palladian nodes have a dedicated "InformationGain" calculator node, which basically uses the same measure which is (often) used in decision trees and provide you with an output table which will give you the IG value for each input variable.

Hope this gives at least some general ideas.

Philipp

mauuuuu5 · April 13, 2016, 6:00am

Hi you can check at this link which measures variable importance. To be sincere I have not tried it by myself .

"Variable Importance in Prediction (Classification or Regression) Molels"

http://bit.ly/23t8QcO

Hope it helps

nakuld · April 13, 2016, 12:09pm

Thank you very much! I will have a look and try the different approaches

nakuld · April 18, 2016, 4:41pm

Hi,

There is one example workflow given in the KNIME public server named variable importance.

after running the workflow I have this as output. i would like to know what does the exactly show. Does this mean Universe_1_3 has the maximum amount of errors in the prediction model??? And then how do we get to know which input variable has the maximum influence on the output classification??

Thanks

vi.png

Geo · April 19, 2016, 1:05am

With this small number of features (as shown in your example), I wonder whether you are really looking for variable assessment methods or whether they would even be useful. Such methods usually guide you when you do not have any backing theory / domain knowledge or when there are simply too many features to select manually from. In any other case, it is IMO better to perform the selection manually.

However, if you are rather looking for a way to assess the importance of each input on the target, while keeping the other inputs constant, then a good approach might be to use Logistic Regression Learner and to analyze the coefficients for each variable.

You could also directly estimate the target "Quality%" as a continuous variable instead of a binary variable. The natural choice would then be Linear Regression Learner.

nakuld · April 19, 2016, 11:20am

Hi Geo.

The example I have shown is just for reference. I am actually working on a large data set with lot of variables.

Thanks for your suggestion.

Geo · April 19, 2016, 1:34pm

Ok, in that case I would add that you might try dimension reduction such as PCA on your inputs with a linear regression thereafter. Another approach to explore.