How to split the dataset/table using criteria?

Hello,

I’m trying to split/separate my dataset/table based in some criteria in order to have, as example, the date is 80% users “A” and 20% users “B”, I need to change it to 60% percentage of users “A” and 40% percentage of users “B”. In SPSS there is a node called “Balance” that allow us to do it, but I didn’t find a node in Knime… Do you guys know if there is a way to do it in Knime?

Hello @fivescar and welcome to the KNIME community

I think that the node that you need is called ‘Partitioning’

It allows you to split your data dat based in a resulting count quantity (percentage) criteria.

you can split by position, linear sampling or randomly…

BR

9 Likes

Churn.knwf (170.4 KB)

Hi @fivescar

I don’t believe there is a single node that does the balancing like SPSS Modeler, but you can achieve a similar effect by using a combination of nodes and some light algebra.

4 Likes

Hello,

Thanks for helping me! This one is pretty close to what I need. In my case, I have several segments (1, 2, 3 etc) and I need to apply this partition by segment, like:

Segment 1: From 100%, select randomly 70%
Segment 2: From 100%, select randomly 30%

Maybe I’ll have to split my dataset and then apply partitioning to each one separately! Thanks!

Hello @fivescar
There’s not a single node that can hold your custom segments configuration.
As commented before Partitioning is your node. You can start with a two columns table: $segment$ (string) and $fraction$ (double), and connect this table to a ‘Table Row to Variable Loop Start’ node.

By using this two variables in every loop iteration; you will control the nodes in your workflow:

  • $${Ssegment}$$ controlling ‘Row Filter’ your dataset segment.
  • $${Dfraction}$$ controlling Partitioning’s Relative [%]

Then you can end your workflow with connecting the two ports from the partitioning, to a two ports’ Loop End.

BR

@fivescar
If I understand correctly you could also try a group loop to iterate your segments and create the split for each segment loop and combine them at the end
(And not exactly what you described but maybe you could check out a stratified sampling in partion by segment. This would give you the same distribution of segments in your train and test split in case this is the direction you are going)
br

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.