I have a general question about sampling. My data set contain ~50 Mio rows. My purpose is to cluster the data with k-means. I already cleaned the data.
My next step would be to sample my data set, in order to make the clustering more effective and intuitve. The sample surely has to be representative to cover all relevant clusters.
1.) Do row sampling in my purpose make sense or should I use the whole data set?
2.) If row sampling make sense, which opportunities do I have with KNIME? (I guess this ones: https://www.knime.org/files/nodedetails/_manipulation_row_row_transform_Row_Sampling.html).
Thanks in advance for the answers!
1. ) Yes it does make sense as processing this huge amount of rows will take quite some time.
2.) Row Sampling, would give you a random subset.. All of our other sample nodes are mainly designed for an additional class column (like Equal size sampling)