New feature request: Partitioning and Row Sampling nodes performing stratified sampling with continuous variables

gcincilla · February 7, 2016, 3:18pm

Hello guys,

The Partitioning and the Row Sampling nodes are very useful to select meaningful subsets of a certain row set. Nevertheless it seems that the “Stratified sampling” option inside those can be used only with categorical variables. It would be great if such option would be available in these two nodes also for continuous variables, as in the case of Cross Validation X-partitioner node.

Would that be possible?

Thank you

Gio

thor · February 7, 2016, 3:36pm

Stratified sampling is only defined for nominal/categorical values. How should it work with continuous values anyway?

aborg · February 7, 2016, 6:57pm

You might want to use the Auto-Binner node and stratified sampling based on the bins.

gcincilla · February 9, 2016, 6:38pm

Thank you for your answers.

Thor, I don't know exactly what is the internal procedure used, but a stratified sampling on a continuous variable is what is offered by theCross Validation X-partitioner node. Do I have misunderstood something there??

The strategy suggested by aborg it seems very reasonable to me.

thor · February 9, 2016, 10:17pm

This options doesn't make any sense, also not in the X-Partioning node. In principle it would work if you had only a few different values in the numeric column, however this usually isn't the case. And then stratified sampling essentially turns into random sampling. We should probably remove the possibility to select any column in the X-Partitioning node.