Partial Least Squares (PLS) regression is a widely used modeling algorithm with broad applications in many data science fields. While this can be accessed via R or possibly Python scripting I think it deserves native Knime nodes. Besides training and prediction, being able to calculate and output PLS loadings and scores is critical when implementing this.

I have used an R-language implementation for many years. I will see if I can get permission to share it from my employer. Also have PLS-DA, OPLS and PCA/S.I.M.C.A (from Bioconductor pca which handles missing data)

It is out! Sklearn: The KNIME Nodes for Scikit-Learn (sklearn) Algorithms developed by our KNIME Python team
The extension contains the following learners:

Partial Leaset Squares (PLS) Learner

Lasso Regression Learner

Gaussian Process Regression Learner

Gaussian Process Classification Learner

as well as the corresponding predictors.

Download it on the Hub for KNIME Analytics Platform 4.7 and give us feedback!

Great to see PLS and LASSO nodes two of the most useful regression methods. HOWEVER the Scikit Learn PLS implementation is not āindustry standardā NIPALS PLS algorithm and incapable of handling missing data which is one of the strengths of the original PLS algorithm as implemented in most multivariate software packages.
I see in the configuration dialog the only options are āSkip rows with missing dataā or āFail on observing missing valuesā this is a completely alien concept to most users of PLS
Time and permission permitting I am planning to try to write a NIPALS PLS node, firstly I need to get my head around writing KNIME nodes in Python. I will try the new nodes out on a non-missing dataset shortly. Thanks anyway for working on this - it is a step in the right direction.

OK Iāve tried to run the PLS node using the āgasolineā data from the R packege 'plsāIt reads the data OK but the node refuses to configure and I get this error in the console:

I tried it on some of my own data and it works! But it seems that that the number of components cannot be higher than the number of response variables / targets. This is the first time I see this in a PLS algorithm; the āplsā package in R does not have this limitation. According to the sklearn documentation there is a difference between āPLSCanonicalā, which has this limitation, and āPLSRegressionā, which does not have this limitation. See section 1.8.3 on https://scikit-learn.org/stable/modules/cross_decomposition.html#cross-decomposition So I guess the node uses PLSCanonical?

Could you please clarify what you mean āthe node refuses to configureā? Does the configuration dialog show up at all? Can you connect an input table and run the node without opening the configuration dialog (even though the settings might not make sense)?

The Warning about the āunknown formatā should not be the reason for any failures.

And about the selection of the PLS method. Scikit learn states The PLSRegression estimator is similar to PLSCanonical with algorithm='nipals', ..., and we are using PLSRegression so it should be the NIPALS algorithm. Why do you say that it is not NIPALS?

I am not an expert in PLS methods, could you please outline how missing data is handled by other packages providing PLS methods?

Hi Carsten,
The reason I say this is that the original implementation of NIPALS inherently copes with missing data. When calculating the scores loadings and weights it skips over any missing values, (provided that there are minimum number of data points as set as an option is software). I have heard (but may be mistaken) that the SciKit Learn PLS does not cope with missingness and so clearly is not implementing the original algorithm. The NIPALS algorithm is well described in "Partial Least Squares Regression: A Tutorial by Pula Geladi and Bruce R Kowalshi Analytica Chimica Acta 185 (1986) 1-17

Regarding the number of components in PLS this frequently exceeds the number of Y variables (so for single Y you may end up with 2,3 or 4 components as they are successive approximations by the model in order to account for variation in X unrelated to Y. In practice this may be caused by light scattering in spectroscopy or systematic experimental effects in other applications. Another reason is that there may be mild non-linearities that are modelled better by multiple componentās than one. This may be overcome by using the more modern variant of PLS called OPLS - see the papers by Johan Trygg if you want to understand this. In OPLS the extraneous āorthogonalā information that is in X is partitioned and only the information in X related to Y is used in the model. Another alternative is target projection PLS - see the papers of Olav Kvalheim. Important to note that these orthogonal PLS method are no better at prediction - they just help interpretation. So in the current context having a working PLS that handles missing data will be sufficient.
Now coming onto the error message - I have a R-source node from which I am extracting the dataset gasoline from the pls package. The output is a matrix of 401 X variables and 1 Y variable (Octane). Each column is rendered as a Standard Double in the KNIME table. I then use a partitioning node to split into test train with 50 as the training set and 9 as test set (following the R-example). In the PLS node I select the wavelengths as the Features and octane as the Target. Skip rows with missing values (I donāt think there are any in this dataset) and 2 components to keep. Then the node just stays āredā and wont configure -----> AH-HA - thatās the problem in fact if I select 1 component it works so it must be the limitation in the algorithm not allowing more PLS component than Y varaibles mentioned above.

That is a strange limitation for anyone coming from a Chemometrics background. It normal to expect several PLS components even on a single Y model for reasons mentioned above.

I will test the PLS node given this limitation on components and get back to you. Obviously thatās a big limitation and wont help the PLS get to the optimum model.

I have made a model with one PLS component and it successfully predicts using the Regression Predictor. Model performance is poor with just a single component giving an R2 of just 0.295 and RMSE error of 1.269

If instead I use my own R-based PLS node I get a 2 component model with R2= 0.968 and RMSE of 0.27

If I use my home made R-based PLS node and restrict it to 1 PLS component I get an R2 of 0.294 and RMSE of 1.3 so it would seem your PLS node is giving similar results to the R implementation of PLS.

So if you can overcome the limitation of number of components and the no missing data limitations it could be useful.

I am hoping to get permission to release my R based PCA and PLS nodes to the community within a few months. The PCA one can cope with missing as its built on R library pcaMethods but the pls package is no missing data. If I can brush up on my Python I may try to write a native Python node in due course. Another component I have made deals with the pre-processing that is commonly applied to chemometric models and Iām hoping to release that as well.

I will try to take a look at the LASSO regression as well this week,

Thanks so much for the detailed explanation and your experiments, this is really valuable feedback !

Regarding the number of components, we relied on the documentation of scikit-learn. We will investigate whether we can simply disregard this limitation and still use their implementation and will let you know about the results very soon.

We have tried the sklearn implementation of the PLS algorithm a bit more and indeed, the limit of the number of components to be less than or equal to the number of components seems to be an artefact of their documentation. We have removed this limitation in the PLS node and started a new build of the extension. You should be offered an update of the KNIME sklearn nodes to build 202305121304 later tonight. Or wait for next week, it should definitely show up on Monday.

Would you be so kind to try whether the results match the PLS implementation also with 2 components in the scenario you mentioned above?