Node suggestion: PLS regression

Partial Least Squares (PLS) regression is a widely used modeling algorithm with broad applications in many data science fields. While this can be accessed via R or possibly Python scripting I think it deserves native Knime nodes. Besides training and prediction, being able to calculate and output PLS loadings and scores is critical when implementing this.

Thanks/Evert

Hi Evert,

you are right. We are working on some nodes in this regard and will tell you when we publish them.

Best regards
Steffen

8 Likes

I have used an R-language implementation for many years. I will see if I can get permission to share it from my employer. Also have PLS-DA, OPLS and PCA/S.I.M.C.A (from Bioconductor pca which handles missing data)

Hi Steffen, It would be great if this could be a NIPALS implementation that handles missing data also. Mark

Using Knime has allowed me to avoid coding so far so I would strongly prefer native nodes.

Hi Mark,

the extension we are developing will provide algorithms from sklearn for now.

Best regards
Steffen

4 Likes

Dear @steffen_KNIME ,

PLS regression / PLS-DA would indeed be incredibly useful. Non-negative matrix factorization (NMF) is another one on my wishlist.

Best,
Aswin

1 Like

Hi @evert.homan_scilifelab.se, @Mark_Earll, @Aswin and everybody else!

It is out! Sklearn: The KNIME Nodes for Scikit-Learn (sklearn) Algorithms developed by our KNIME Python team
The extension contains the following learners:

  • Partial Leaset Squares (PLS) Learner
  • Lasso Regression Learner
  • Gaussian Process Regression Learner
  • Gaussian Process Classification Learner

as well as the corresponding predictors.

Download it on the Hub for KNIME Analytics Platform 4.7 and give us feedback!

Best regards
Steffen

4 Likes

Great to see PLS and LASSO nodes two of the most useful regression methods. HOWEVER the Scikit Learn PLS implementation is not “industry standard” NIPALS PLS algorithm and incapable of handling missing data which is one of the strengths of the original PLS algorithm as implemented in most multivariate software packages.
I see in the configuration dialog the only options are “Skip rows with missing data” or “Fail on observing missing values” this is a completely alien concept to most users of PLS
Time and permission permitting I am planning to try to write a NIPALS PLS node, firstly I need to get my head around writing KNIME nodes in Python. I will try the new nodes out on a non-missing dataset shortly. Thanks anyway for working on this - it is a step in the right direction.

1 Like

OK I’ve tried to run the PLS node using the ‘gasoline’ data from the R packege 'pls’It reads the data OK but the node refuses to configure and I get this error in the console:

WARN CEFNodeView unknown format “int32” ignored in schema at path “#/properties/model/properties/algorithm_settings/properties/n_components” (source: http://org.knime.core.ui.dialog/dialog_org.knime.python3.nodes.extension.ExtensionNodeSetFactory$DynamicExtensionNodeFactory/NodeDialog.umd.min.js; line: 1)

1 Like

Dear @steffen_KNIME

I tried it on some of my own data and it works! But it seems that that the number of components cannot be higher than the number of response variables / targets. This is the first time I see this in a PLS algorithm; the “pls” package in R does not have this limitation. According to the sklearn documentation there is a difference between “PLSCanonical”, which has this limitation, and “PLSRegression”, which does not have this limitation. See section 1.8.3 on https://scikit-learn.org/stable/modules/cross_decomposition.html#cross-decomposition So I guess the node uses PLSCanonical?

Strangely, the above is contradicted in https://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.PLSRegression.html#sklearn.cross_decomposition.PLSRegression, which says that the number of targets DOES limit the number of components in the PLSRegression algorithm.

Best
Aswin

1 Like

Thanks @Mark_Earll and @Aswin for the feedback!

We will have a look and I will come back to you.

Best regards
Steffen

1 Like

Hi @Mark_Earll,

Thanks for trying the PLS node!

Could you please clarify what you mean “the node refuses to configure”? Does the configuration dialog show up at all? Can you connect an input table and run the node without opening the configuration dialog (even though the settings might not make sense)?

The Warning about the “unknown format” should not be the reason for any failures.

And about the selection of the PLS method. Scikit learn states The PLSRegression estimator is similar to PLSCanonical with algorithm='nipals', ..., and we are using PLSRegression so it should be the NIPALS algorithm. Why do you say that it is not NIPALS?

I am not an expert in PLS methods, could you please outline how missing data is handled by other packages providing PLS methods?

Thanks,
Carsten

Hi Carsten,
The reason I say this is that the original implementation of NIPALS inherently copes with missing data. When calculating the scores loadings and weights it skips over any missing values, (provided that there are minimum number of data points as set as an option is software). I have heard (but may be mistaken) that the SciKit Learn PLS does not cope with missingness and so clearly is not implementing the original algorithm. The NIPALS algorithm is well described in "Partial Least Squares Regression: A Tutorial by Pula Geladi and Bruce R Kowalshi Analytica Chimica Acta 185 (1986) 1-17

Regarding the number of components in PLS this frequently exceeds the number of Y variables (so for single Y you may end up with 2,3 or 4 components as they are successive approximations by the model in order to account for variation in X unrelated to Y. In practice this may be caused by light scattering in spectroscopy or systematic experimental effects in other applications. Another reason is that there may be mild non-linearities that are modelled better by multiple component’s than one. This may be overcome by using the more modern variant of PLS called OPLS - see the papers by Johan Trygg if you want to understand this. In OPLS the extraneous ‘orthogonal’ information that is in X is partitioned and only the information in X related to Y is used in the model. Another alternative is target projection PLS - see the papers of Olav Kvalheim. Important to note that these orthogonal PLS method are no better at prediction - they just help interpretation. So in the current context having a working PLS that handles missing data will be sufficient.
Now coming onto the error message - I have a R-source node from which I am extracting the dataset gasoline from the pls package. The output is a matrix of 401 X variables and 1 Y variable (Octane). Each column is rendered as a Standard Double in the KNIME table. I then use a partitioning node to split into test train with 50 as the training set and 9 as test set (following the R-example). In the PLS node I select the wavelengths as the Features and octane as the Target. Skip rows with missing values (I don’t think there are any in this dataset) and 2 components to keep. Then the node just stays “red” and wont configure -----> AH-HA - that’s the problem in fact if I select 1 component it works so it must be the limitation in the algorithm not allowing more PLS component than Y varaibles mentioned above.

That is a strange limitation for anyone coming from a Chemometrics background. It normal to expect several PLS components even on a single Y model for reasons mentioned above.

I will test the PLS node given this limitation on components and get back to you. Obviously that’s a big limitation and wont help the PLS get to the optimum model.

Cheers,

Mark

1 Like

Hi Carsten,

I have made a model with one PLS component and it successfully predicts using the Regression Predictor. Model performance is poor with just a single component giving an R2 of just 0.295 and RMSE error of 1.269

If instead I use my own R-based PLS node I get a 2 component model with R2= 0.968 and RMSE of 0.27

If I use my home made R-based PLS node and restrict it to 1 PLS component I get an R2 of 0.294 and RMSE of 1.3 so it would seem your PLS node is giving similar results to the R implementation of PLS.

So if you can overcome the limitation of number of components and the no missing data limitations it could be useful.

I am hoping to get permission to release my R based PCA and PLS nodes to the community within a few months. The PCA one can cope with missing as its built on R library pcaMethods but the pls package is no missing data. If I can brush up on my Python I may try to write a native Python node in due course. Another component I have made deals with the pre-processing that is commonly applied to chemometric models and I’m hoping to release that as well.

I will try to take a look at the LASSO regression as well this week,

Cheers,

Mark

1 Like

Hi Mark,

Thanks so much for the detailed explanation and your experiments, this is really valuable feedback :pray: !

Regarding the number of components, we relied on the documentation of scikit-learn. We will investigate whether we can simply disregard this limitation and still use their implementation and will let you know about the results very soon.

Best,
Carsten

1 Like

Hey Mark,

We have tried the sklearn implementation of the PLS algorithm a bit more and indeed, the limit of the number of components to be less than or equal to the number of components seems to be an artefact of their documentation. We have removed this limitation in the PLS node and started a new build of the extension. You should be offered an update of the KNIME sklearn nodes to build 202305121304 later tonight. Or wait for next week, it should definitely show up on Monday.

Would you be so kind to try whether the results match the PLS implementation also with 2 components in the scenario you mentioned above?

Have a nice weekend!
Carsten

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.