Regression Machine Learning

Hi there,

I am trying to fit a regression model in knime. I have several questions for you. I am very confused.

My model have independent values of:

-Vehicle Type: Categorical(5 category)
-Company Type: Categorical(6 category)
-Volume: Numerical

Dependent value is:
-Unloading time per volume unit (Numerical, EXPONENTİAL Distribution)

I may use random forest etc. but my priority is getting a polynomial regression with coefficients. And I have these questions for you

1-I used X-partitioner for partition, and at the end I want to see model coefficients of aggregated data. But I think if I right click on Polynomial Regression Learner node, it only shows the last partition coefficients, not the aggregated model. How can I get the coefficients of the aggregated model, is it possible?

2-My dependent variable is distributed with an exponential distribution. Shall I transform it using LOG transformation, the MATH modules are for that task in the image.

3- Also I have problems with parameter optimization and x-partitioner. I want to optimize parameters of random forest but x-partitioner recursion interferes with loop I think. It always says “Wrong Loop Start Node Connected” or “Can’t merge flowvariable stacks(likely a loop problem)” according to orientation of variable ports.

4-) Moreover, even if the 3rd one works, there is a big question. Parameter optimization must increase variables after x-partition complete 10 partitions.
In summary, after 10th partition ends, parameter optimization node shall be triggered.
Actually, x-aggregator sends it back to beginnning each time, and do not let flow continue till last partition I think. But optimization start node is a problem, where and how shall I place it?

Thanks for your help, have a nice day

Thanks for your help

Hello kilincali35,

  1. Cross-validation is usually used to find good hyperparameters while the final model should be trained with the full dataset. If you want to perform bagging (i.e. create multiple models as an ensemble), the cross-validation nodes are not well suited but you can easily create such models using more generic KNIME loop nodes.

  2. Yes, if you know your target distribution it is advisable to bring it closer to a normal distribution since most regression algorithms work best for normal distributions.

  3. You need to connect the flow variable ports of the loop start nodes otherwise KNIME can’t identify which should be the outer loop.

  4. If I understand your point correctly, the parameter optimization loop should be the outer loop (also wouldn’t make much sense otherwise). In this case, you will have to connect the flowvariable port of the Parameter Optimization Loop Start to the X-Partitioner node. KNIME will then perform a cross-validation per parameter configuration.

I hope this helps!

Cheers,

nemad

1 Like

Thank you Nemad for your detailed answer, it is a good step forward for me. I am still learning, trying try everything possible about Knime, and get stuck mostly :slight_smile:

3rd-4th questions are perfectly solved by your answer. I learned something new and very happy right now.

2nd answer is also clear here.

I have something for 4th one, an extension, lets say 4-b

4-b) Final result is showing me the last grid parameters. But I want to see the “optimum parameter set” results. What shall I do?

Get “Best Parameters” from optimization loop end somehow, then again somehow use these in another random forest node as input? How? (1st approach)

Or is there a way to record all iterations, filter the best one among them? (2nd approach)

Also for the 1st question:

1-) I actually trying to use k-fold as a basic precaution against overfitting. I just want to use k-fold with polynomial regression not a bagging model, but I can’t get the final coefficients for full data. It only shows the coefficients of kth iteration when i right click on learner node.
It is a similar problem like 4-b. Isn’t there a way of recording coefficients for each k iteration, then may get average for each coefficient? (It sounds statistically silly, is it possible by the way?)

Hello again,

I am glad my answer did help you out =)

Regarding your additional question, I am not entirely sure what you mean. The first output of the Parameter Optimization Loop End provides the best parameter setting i.e. the parameter setting that achieved the best performance. The second output contains all parameter settings and their achieved values.

The screenshot below shows how you can automatically retrain the model with the best parameter configuration.

If you want, you can build a loop that does what you are describing but I believe there is a slight misunderstanding regarding the purpose of cross-validation. The idea of cross-validation is to give a robust estimate of model performance for a specific set of hyperparameters (e.g. the number of trees in a random forest).
Once you found the best hyperparameters, it is common to retrain the model on the full dataset.
This is because the models inside of the cross-validation are trained on only a subset of the data and since more data usually improves model quality, it is sensible to retrain the model with the full dataset once you find a good hyperparameter configuration.

More or less the same thing applies for parameter optimization, especially if you combine it with cross-validation.

Cheers,

nemad

4 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.