Quantile Feature Requests

Hi Everyone,

Working extensively with Quantiles in larger data sets so some feature requests.
My analysis primarily looks at the quantile rank of a specific row’s value in the context of a specific column.

If I am simply missing something that addresses these, please let me know but I have looked at a number of the older threads on quantiles.

Auto-Binner Node

  • Please enable selection of different quantile methods, like in the GroupBy node.
  • Custom Quantile lists should not be forced to include 0.0 and 1.0 automatically, especially as this does not seem possible to override.
  • Better Bin Naming options, for example the the Bin Naming is currently done based on the order of the items, so that a list of 0.0, 0.25, 0.33, 0.5, 0.75, 0.99 show up as Bin 1, Bin 2, Bin 3, Bin 4, Bin 5, please allow the custom quantile value to be passed as part of the bin name, so Bin 0.0, Bin 0.25, Bin 0.33, etc.

Math Expression Node and/or Column Expression Node

  • Please add quantile formulas, including the various method options in these nodes.

GroupBy Node or Rank Node?

  • It would be nice to have functionality in this type of node that would enable identifying easily what quantile a specific value is in relation to the rest of the values in the column, rather than calculating a quantile value.
  • It would be nice to be able to do this with a number of quantiles all at once.

Would also be great if some of these functions could be extended to the DB Group By / AutoBin nodes.

Thanks,

Nathan

1 Like

Hello Nathan,

thank you for your feedback and suggestions.
Regarding the Auto-Binner, we have open tickets for choosing quantile estimation methods (AP-17420) and adding more naming options for bins (AP-15077). I added a +1 for you to each.
I do not completely understand your point on “Custom Quantile lists should not be forced to include 0.0 and 1.0 automatically, especially as this does not seem possible to override.” Enforcing the lower and upper bounds to the minimum and maximum of the data values ensures that bins cover all values. Otherwise the lowest/highest values would not be part of any bin. Can you explain your case in which such behaviour makes sense?

Introducing quantile functions to Expression nodes makes sense to me. I added it to our list of feature requests for Expressions.

I hope that I understand what you are trying to do with the GroupBy node. If the value you are looking for is in the each group and the size of your data permits, you may try to aggregate using “List (sorted)”. Then use a column expression like arrayIndexOf(column("Sorted list*(<COLUMN_NAME>)"), <VALUE>) / arrayLength(column("Sorted list*(<COLUMN_NAME>)")) to get the value’s location relative to the group size. If this does not work, you may need to some more scripting. I created a features request ticket (AP-19058).

I will let you know about any progress on the tickets.

4 Likes

Hi Nan,

Thank you so much for your feedback and suggestions.

Re: 0.0 and 1.0 bounds

That makes sense. I guess historically in other tools I’ve only focused on what I needed to see, rather than what the tool needs to calculate the bins.