Segmentation on % split

Hello everyone,

I have following problem, I need to segment players based on specific KPIs on a percental split.
So I get buckets with low, medium, high, with the size of 60%,30%,10% of the sample depending on the KPI.
I’m not quite sure what would be the best approach here, I kind of found some workarounds with using an auto-binner and just creating 10bins, the first six, mid 3 and top 1 but as I have to do this for 20+ KPIs it will be pretty tedious.
Is there an easy way I didn’t think of yet?
Goal at the end is to go from this:
image

to this:
image
(segment column is only filled exemplary of course)

I’ve attached a sample file in case it’s needed.
I would be very grateful for any help :slight_smile: Thanks!

KNIME_test.xlsx (17.2 KB)

Hello,

Could you attach your workflow and explain what part is tedious? I think if you have a set up with flow variables and/or loops, this can be automated. But I’m curious to see what you did in the workflow and with this sample data.

Hey @victor_palacios

Thanks for your reply - I was a bit busy so my apologies for the late reply.
I’ve found a solution, which feels a bit “unclean”

I’m running through each column that needs the binning (group loop is because the dataset needs to be segmented by country) I segment into 10 bins, then assign Bin 1-6 as Bin 1 , Bin 7-9 as Bin 7 and Bin 10 as Bin 10 with the second Binner. That’s how I get the uneven split.

It works so I am not too mad about it, but I also need to know the limits for the bins, is there a way to extract it? I see that there is a port for the PMML model fragment which should that information but it doesn’t seem like I can extract that data.

Do you know a good solution?

KR Chaos

And additionally - How can I now segment the customers based on all the thresholds I assigned through the bins?
So let’s say I have 5 columns with doubles and one with a boolean and I want to define my bins by all 6 of these values. Again into 60%, 30% and 10% group based on their “performance” And then see what the limits for each KPI is

The boolean can be converted into number. It’s like 0 and 1

I couldn’t find a way to get the information about the bin borders from auto-binner, but since you’ve already segmented, couldn’t you get the min and the max for each bin you have (bin 1, bin 7 and bin 10) using groupby?

I’ve tried using the group by method but it doesn’t seem to work correctly as I get a different number of entries per criterial. And technically it should always have the same spread. That’s why I wanted to confirm what the limits were.

Perhaps the GroupBy node requires different configurations? I’ve not seen the groupby node not work before. Can you share a workflow with dummy data perhaps?

@Chaosprinzip I think you could employ a counter and produce a rank and then get the thresholds for high, mid and low.

My problem is that it seems the case is not well defined. Are we talking about a rank or would the 60% be from the total amount of the value column? Maybe you could explain that further.

Here is a quick and dirty approach to use a Counter to create a relative Rank and automatically determine the threshold of the NGR column. But I think it needs a more clear definition from your side: