Hello,
My dataset has 400 rows and 14 columns. I take rows 1-100 of data and find clusters using pca and t-SNE. Now, I want to see if any of rows 101-200 lie within any of the cluster I have found for rows 1-100. How to do that?
The following figure shows the clusters obtained for rows 1-100.
Hello @moriks,
Welcome to the community!
Typically t-SNE isn’t used for clustering assignment due to the loss of relative distances between clusters. Because of this, it is hard to apply it as a model for assignment due to the lack of the typical calculations such as centroids.
I would take a look at the following thread for more info on this:
So you have a couple options here:
The first you can do is just pass your transformed data to a ‘K-means’ node as you already have the general number of clusters you expect from t-SNE.
For your case you can set the number of clusters to 6. Using this node should automatically run on your whole input and it should label each of the clusters it placed the rows into.
The next option can possibly use your output from the t-SNE. you can use ‘Similarity Search’ and ‘Numeric Distances’ nodes where you first calculate centroids of each of the clusters found in the t-SNE (I am not sure how your data looks, but I assume it adds labels to it) by using a group by and finding the average. These calcs should go to the reference table port, and the rest of your data to the query table port. You will need to set up how close a point should be to determine if it is part of the cluster.
I would take a look at some popular clustering algorithms for any other alternatives. I believe using DBSCAN will be a good pick for running initially on the first 100 rows, but it may be overkill if this is meant to be a simple workflow.
Hope this helps,
TL