I had a great time at last week’s conference, and I have observed great predictive results using the Random Forest learner, but I am looking for guidance on something else.
Given a fairly small set of rows (records) but a large number of columns, what node might be best used to find the columns that have the highest correlation to the target field.
Thanks in advance for any guidance - I’m really enjoying Knime and hope to make this a tool that can greatly aid my organization in 2020 and beyond.
first of all welcome to the KNIME forum.
We are excited to hear that you enjoyed the summit!
Now to your question:
Are you interested in identifying the variables that best explain your target variable?
If so, then sparse linear models might be one way to go.
In a linear model, the target variable is modeled as a weighted sum of the predictor variables and in a sparse linear model the weights of most predictor variables are 0 so that they don’t have an effect on the prediction.
Depending on the type of target variable (numerical in regression tasks or nominal in classification tasks) KNIME offers different nodes to train such a model.
For classification you can use the Logistic Regression Learner with a Laplace prior which needs to be enabled in the advanced tab of the dialog. The degree of sparsity is controlled by the variance parameter where a smaller variance corresponds to sparser models.
For regression tasks you can use the H2O Generalized Linear Model Learner (Regression) which offers a plethora of options among which there are multiple ways to train a sparse model. Perhaps the easiest is to set the “Set maximum active predictors” to the number of predictors you are looking for i.e. if you believe your target should be explainable by 10 variables, then set this value to 10 and the node will learn a model that uses the 10 most significant variables. Note that it is also possible to use the H2O Generalized Linear Model Learner node for classification tasks.
All the mentioned nodes provide a table of the coefficients (i.e. the weights assigned to the predictor variables) as additional output. Due to the structure of a linear model, the magnitude of these weights gives you an indication how important the variable is.
One disclaimer though: In order to get the best results from these nodes, you should make sure to properly normalize your data using the Normalizer node with z-score normalization.