not sure if this section is the right section to post a question about the Knime Labs Statistics Node package
I’m trying out the t-SNE package, but i’m having some troubles wrapping my head around how to use it in a production setting, where i want to apply the transformation on fresh incoming data
I was actually looking for a similar approach as the PCA Compute / Apply setup; where you can compute the transformation, and then export PMML and import it into the Production pipeline for the Apply step
is a similar approach possible? or how to go about applying the transformation on new data?
– e.g. outside of a train-test workbench
I do not think this is easily possible. T-SNE is mostly there to explore data structures and get new ideas but does not work in a traditionel model and deploy sense. One suggestion (machine learning - python tsne.transform does not exist? - Stack Overflow) is to train the new data together with the old one and use the result and one more approach by the developer himself (see FAQ). But I do not think it is to be implemented in the near future in a node:
Once I have a t-SNE map, how can I embed incoming test points in that map?
t-SNE learns a non-parametric mapping, which means that it does not learn an explicit function that maps data from the input space to the map. Therefore, it is not possible to embed test points in an existing map (although you could re-run t-SNE on the full dataset). A potential approach to deal with this would be to train a multivariate regressor to predict the map location from the input data. Alternatively, you could also make such a regressor minimize the t-SNE loss directly, which is what I did in this paper.
What you can do is use t-sne to select relevant groups and check them out. I have an older workflow somewhere still using the R t-sne implementation.
As long as t-SNE is a non-linear algorithm, so it cannot be reverted, and it always depends on the data and even initial random distributions, so you cannot use the same way as PCA. PCA - is a linear algorithm, this means you can always have the same matrix of transformations that you can apply to a new data set with the same structure.
This way, I believe you can still use t-SNE in your production pipeline, the only difference is that the vector basis will always be different, as well as meaning of t-SNE projections. So let’s say if you are using t-SNE only for dimension reduction for clustering, ML or manual analysis based on visualization then it is completely fine. However if you are trying to use t-SNE projections for explaining the models, then t-SNE cannot help here.