The programming behind the Stratified sampling in KNIME

Hi all,

 

I am working with a dataset from which I need to sample every now and then. I normally use the node "row sampling" and choose the option of "stratified sampling" to get my samples. I have been recently reading and found out that there are 2 different approaches that gives a sample under the term "stratified sampling" (from Wikipedia).

 

One of them depends on the "sampling fraction" where the size of samples (assume it is sampling from 2 groups, A and B) have to reflect the original distribution, While the other one depends on the standard devation, making stratums that are similar (or close) in standard deviation to the original data. My question here is which one of these 2 approaches does the "row sampling" node use in the "stratified sampling" option?

 

Many thanks,

Error404

The first one, the proportions of the occurrences of the different values in the column are retained.

Thanks for your reply. I now understand how this works, but what if the column I set as the "nominal column" contain only 1 nominal category?

Then it's a purely random sampling.

Thanks for your replies Thor. Just to make sure I get this correctly (as this step is important for me), if I had a sample that belongs to 2 groups. Group A represents 80% and Group B represents 20% of the data and I perform this "stratified sampling" the sample will have the same property of consisting of 80% items from A and 20% items from B and all these items will be RANDOMLY drawn from the original A and B groups. Is that correct?

Correct.