The programming behind the Stratified sampling in KNIME

Error404 · October 25, 2013, 11:45am

Hi all,

I am working with a dataset from which I need to sample every now and then. I normally use the node "row sampling" and choose the option of "stratified sampling" to get my samples. I have been recently reading and found out that there are 2 different approaches that gives a sample under the term "stratified sampling" (from Wikipedia).

One of them depends on the "sampling fraction" where the size of samples (assume it is sampling from 2 groups, A and B) have to reflect the original distribution, While the other one depends on the standard devation, making stratums that are similar (or close) in standard deviation to the original data. My question here is which one of these 2 approaches does the "row sampling" node use in the "stratified sampling" option?

Many thanks,

Error404

thor · October 25, 2013, 12:40pm

The first one, the proportions of the occurrences of the different values in the column are retained.

Error404 · October 25, 2013, 12:48pm

Thanks for your reply. I now understand how this works, but what if the column I set as the "nominal column" contain only 1 nominal category?

thor · October 26, 2013, 10:29am

Then it's a purely random sampling.

Error404 · October 27, 2013, 7:36pm

Thanks for your replies Thor. Just to make sure I get this correctly (as this step is important for me), if I had a sample that belongs to 2 groups. Group A represents 80% and Group B represents 20% of the data and I perform this "stratified sampling" the sample will have the same property of consisting of 80% items from A and 20% items from B and all these items will be RANDOMLY drawn from the original A and B groups. Is that correct?

thor · October 28, 2013, 9:10am

Correct.