Problem with Gaussian Process Regression

giuseppeR · June 9, 2023, 7:48am

Hi guys,

I’m trying to test some AI algorithms in Knime using scikit learn nodes.
I’m trying to give as input a simple dataset where data are over a line (the equation is y=0.5*x+5 or more simple y=x).

I’m using Gaussian Process Regression.

It is very strange the predicted values of target variable y are costants.

I’m saving the trained model in a binary file (Model writer node) and use it in another workflow to make inference: I have in my data only timestamps and the y values (and it is stange because without it I have an error).

In this case I want to predict y variable…but it predicts constant values.

If in the input file I give timestamps and y variable it works (and it predicts constant values), if the input file trere are only timestamps the node fail.
In this case the predictor give an error: it does not find the target variable y:

Someone can help me to understand what is the problem? There is some error in my data?
Attached there is my workflow with different input data.
retta_gaussian.knwf (113.6 KB)

Thanks for your help.

giuseppeR · June 12, 2023, 8:21am

@mlauber71 have you some hint, please?
I suspect the scikit learn node it doesn’t work well.
Thanks.

mlauber71 · June 13, 2023, 7:51pm

@giuseppeR from what I see you might have a time series problem. You have dates and a value (y).

https://www.youtube.com/live/5vJPJZfVGXg?feature=share

giuseppeR · June 16, 2023, 8:39am

@mlauber71 I read the material suggested…but any of this solve my problem.
In this moment I’m using a dummy variable using the node category to number and it seems it works…but I think this is not a good solution.
You see attached my actual solution.
gaussian_forecast.knwf (399.0 KB)
If you have another idea…they are welcome.
Thanks.

Corey · June 26, 2023, 6:33pm

Hi @giuseppeR
I took a look at your workflow, I think the issue here is related to your kernel choice.

A little background: the gaussian process regression is a non parametric model, and behaves a lot differently than most Machine Learning models we talk about. Instead of fitting an equation with parameters over the input features to generate the output features a probability distribution is generated and used for prediction. This probability distribution is fit by magic that I won’t elaborate on here and the assumption that similar input data should have similar output data.

The kernel function choice is used to define what “similar” means. In your workflow the learner node is configured to use the White kernel, this kernel defines two data points as either completely the same if they’re identically or completely different if they’re different.It’s binary.

Because all of your data points are unique they’re all marked as being equally similar to each other and this forces the model to generate a predictive distribution that is just the mean value of the training set.

I’d recommend swapping your kernel to either the default (I think that is a modified RBF) or the RBF which is a good all purpose kernel when working with smooth functions.

In summary: Swap your kernel to either default or RBF and it should work fine.

Corey · June 26, 2023, 7:05pm

Oh - and also you’ll probably want to normalize your timestamp column, being such huge numbers close together seems to be throwing something off as well. Probably in the same way as the white kernel.

You can use the normalizer on the train set and the normalizer apply on the test set.

working edited example

system · September 24, 2023, 7:06pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.