t-SNE - how to use in production pipeline?

hermyknime · November 25, 2022, 11:31am

Hi All,

not sure if this section is the right section to post a question about the Knime Labs Statistics Node package

I’m trying out the t-SNE package, but i’m having some troubles wrapping my head around how to use it in a production setting, where i want to apply the transformation on fresh incoming data

I was actually looking for a similar approach as the PCA Compute / Apply setup; where you can compute the transformation, and then export PMML and import it into the Production pipeline for the Apply step

is a similar approach possible? or how to go about applying the transformation on new data?
– e.g. outside of a train-test workbench

Thx a bunch!

Herman

mlauber71 · November 25, 2022, 12:36pm

I do not think this is easily possible. T-SNE is mostly there to explore data structures and get new ideas but does not work in a traditionel model and deploy sense. One suggestion (machine learning - python tsne.transform does not exist? - Stack Overflow) is to train the new data together with the old one and use the result and one more approach by the developer himself (see FAQ). But I do not think it is to be implemented in the near future in a node:

FAQ (t-SNE – Laurens van der Maaten)

Once I have a t-SNE map, how can I embed incoming test points in that map?

t-SNE learns a non-parametric mapping, which means that it does not learn an explicit function that maps data from the input space to the map. Therefore, it is not possible to embed test points in an existing map (although you could re-run t-SNE on the full dataset). A potential approach to deal with this would be to train a multivariate regressor to predict the map location from the input data. Alternatively, you could also make such a regressor minimize the t-SNE loss directly, which is what I did in this paper.

What you can do is use t-sne to select relevant groups and check them out. I have an older workflow somewhere still using the R t-sne implementation.

Artem · November 25, 2022, 12:36pm

Hello @hermyknime,

As long as t-SNE is a non-linear algorithm, so it cannot be reverted, and it always depends on the data and even initial random distributions, so you cannot use the same way as PCA. PCA - is a linear algorithm, this means you can always have the same matrix of transformations that you can apply to a new data set with the same structure.

This way, I believe you can still use t-SNE in your production pipeline, the only difference is that the vector basis will always be different, as well as meaning of t-SNE projections. So let’s say if you are using t-SNE only for dimension reduction for clustering, ML or manual analysis based on visualization then it is completely fine. However if you are trying to use t-SNE projections for explaining the models, then t-SNE cannot help here.

I hope my explanation will be useful for you.

hermyknime · November 25, 2022, 2:12pm

Hey @Artem , @mlauber71

Thank you very much for the info and the suggestions.

It does make sense to use T-SNE in this case as a way of fine-tuning the selection of meaningful data cases to then do the ML training on, via the actual features.

I was hoping to use T-SNE as a feature-selection / feature reduction approach, but seems it is not that straightforward.

Thank you very much for enlightening this to me! (and hopefully others)

Herman

mlauber71 · November 25, 2022, 2:36pm

If you are interested in that you could check out the preparation section of my machine learning collection. About tools like vtreat (vtreat for KNIME! – Win Vector LLC) for example. Also the relevant KNIME workflow (Techniques for Dimensionality Reduction – KNIME Community Hub) which has a section about t-sne but I do not know if the methods are being stored in a way to reproduce them. The linked paper goes to a broken page …

Also there is a (quite complete) set called “the poor man’s ML Ops” that I have built (s_601 - Sparkling predictions and encoded labels - "the poor man's ML Ops" (on a Big Data System) – KNIME Community Hub) and sort of explained in a video (in german), slides in english: H2O.ai AutoML in KNIME for classification problems - #11 by mlauber71. This is specific to a big data environment but the priciples would apply. If you can do it on a laptop or small server vtreat or something similar might be more powerful.

About dimension reduction there is also featuretools which I plan on building a KNIME workflow around for some time now …

mlauber71 · November 29, 2022, 10:47am

As an addition these articles from the KNIME low-code blog on Medium:

Seven Techniques for Data Dimensionality Reduction

Three More Techniques for Data Dimensionality Reduction in ML

system · February 27, 2023, 10:48am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.