Is there a mechaism already built into KNIME to extract a fixed sample size from a table by class column? I want to go through a table with a column that describes a number of class and select 50 samples from each class leaving the rest, regarless of the total class size, to pass through to the output. I can do this with a stack of Row Splitter/Partitioning nodes, but it is definetly a brute force method. Thanks!
At first thought, if the class column has roughly an equal number of examples of each class, then you can use the stratified sampling option in the Partitioning node, and select the class column.
So if you have 3 classes in which there are roughly a 1/3 of each class, then you can choose stratified sampling, and in the absolute box select 150. This will then give approximately 50 of each class in the output. Is this any use ?
Basically the definition of stratified sampling, is that it maintains the class distribution in the output to that of the input. Hence if the input is 1:1:1, then the output will be 1:1:1 of classes.
Hope it makes sense,
Simon.
You can also do it exactly as well, even if the classes are not evenly distributed, but it is a little more complex to do.
Firstly, take a GroupBy node and connect it to your data, now Group on the Class column, now connect GroupBy node to a TableRow to Variable Loop Start, and now connect the variable out-port to an Inject Variables node in-port. Now also connect your data to the other in-port of the Inject Variables node.
Now add a Row Filter node after the Inject Variables node, and choose the variable column name from the Pattern box in the flow variables tab (or select it from the little button to the right of the pattern box in the main filter criteria tab). Now connect Row Sampling node, select 50 in absolute and to draw randomly, and finally finish this with a Loop End node.
Hope this makes sense,
Simon.
Yepp, that should work. In v2.4 we will add an "Equal Size Sampling" node -- that will make it a bit easier.
Well if the class is not nominal, but you have discrete integers as classes then use the "number to string" node to enable stratified sampling with your class. If the clas is numeric use the automatic binner node, convert to string if required and use row sampling again!
Thanks! Your second reply, richards99, is exactly what I was looking for. Well explained. I'm looking forward to the "Equal Size Sampling" in 2.4.