Data Science Strategy Advice

smithcreed · October 13, 2021, 3:24pm

My goal is to narrow several hundred Variables to those with the highest impact on market value of a series of several hundred homes (some have various qualities while some do not–these are the variables). I used a simple binary scoring for each home and each variable. (1=has variable; 0=does not have variable.)

OneDataSet.xlsx (93.5 KB)

For the initial step my theory is if the P-Value of a variable is <= 0.05 it should be included in the initial list of variables and at least has a probability of making an impact on the series of homes within the group.

I set up a Linear Regression Loop to score the P-Value of each variable.

From a data science perspective is my logic sound and should I be considering other strategies to discover the highest value impacting qualities for each set of homes?

I know this is outside of the typical KNIME question but appreciate your thoughts. Thanks

PS: I did also run a random forest regression and used the KNIME suggestions on deriving the variable importance. I am now looking into how the Linear and Random Forest outcomes align–or don’t, and am open to any/all further suggestions. Thanks

temp3

mlauber71 · October 13, 2021, 5:16pm

@smithcreed to use the variable importance to reduce the number of variables is one way to go and it might just be a robust way.

For further ideas you might want to check out the data preparation links from my meta collection about machine learning. One typical way would also to exclude variables that are highly correlated.

All kinds of clustering techniques might be explored - but you also might loose immediate interpretability on the way.

Daniel_Weikert · October 13, 2021, 5:33pm

Hi
There is a feature selection node which could help you figure out the most important features for your target variable.
maybe have a look at it.
br

smithcreed · October 13, 2021, 5:38pm

@Daniel_Weikert, thank you. I have pulled these nodes and will check them out.

smithcreed · October 13, 2021, 5:43pm

@mlauber71 , thank you. I will check out all. I used the low variance and correlation nodes when reducing the variables from approximately 230 down to 40 or so under different methods. The P-Value in Linear Regression was something new I was trying because I did not like the value contribution results I was seeing. It’s tricky stuff as what I am really doing is isolating the contribution values of four categories within the sale data, one of which you see here in the 230 or so variables.

I will check out all of your suggested links. thanks

system · April 14, 2022, 5:43am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.