Hello,
I am new to machine learning and am currently trying out different models.
I have noticed the following:
when i train a model such as linear regression, i get the weights (coeff. and intercept) at the end of the training.
Now if i repeat the training and do not change the training data, not even the order, then i get exactly the same weights. I don’t understand why.
Shouldn’t the result be at least slightly different, since the initial weights and the optimization of the weights are not exactly the same?
The same thing happens with other models like boosted trees etc.
Also what i dont understand is the learner not only outputs the weights but also a R^2, how? what data does it use to calculate that? the only input to the learner is training data
not all machine learning techniques use randomness for learning/fitting. For instance, there are algorithms used for Linear Regression that are deterministic.
For other techniques that actually use randomness, like Gradient Boosted Trees, a common way to make results reproducible is to fix the seed of the random number generator. As a result, the random number generator repeatedly produces the same pseudo-random numbers. The Gradient Boosted Trees Learner – KNIME Community Hub has a setting in the Advanced tab, which is turned on by default.
Regarding your third point, yes, the Linear Regression Learner – KNIME Community Hub outputs R² in its view. This measures the fit between the training data and the estimated model. An unexpectedly low value of R² can indicate poor model fit, while an unexpectedly high value of R² may be a sign for overfitting.
I have exactly the same results when I turn the “static random number” option on and off in Gradient Boosted Trees. The R^2 and the MSE or MAE do not change at all. There is only a change when I randomize the training data before retraining. How can that be? My workflow is as simple as you can imagine: CSV Reader → Shuffle → Partitioning (from top, not randomizing again) → Learner → Predictor → Scorer.
Another question is where I can see which optimizer and which loss function all the learners are using. In hardly any learner this can be selected, which is why I assume that MSE is selected by default. is this correct? and which optimization method is used?
this is interesting and I can reproduce that the node apparently ignores the random seed setting. I created a ticket for the developers to investigate further (internal reference AP-22377). Thank you for reporting.