Quantiles vs Percentiles in GroupBy

lucian.cristian · August 26, 2021, 3:37pm

Hi,

Probably a dumb question, but here it goes. I wanted to calculate 25th/50th/75th Percentiles for a set of price points using the P^2 Percentile in Group by. It looked ok but i discovered that if I add a sorter before this node and sort values ascending or descending I get different values, which is something I didn’t expect. Calculating 0.25/0.5/0.75 Quantiles returns the same value no matter the sorter node. I would have expected small differences between Quantiles and Percentiles, but didn’t expect sorting to play any role in the differences.

I’m attaching a workflow I created to show different scenarios and results. My question would be why do I get different resulting depending on the sorter node when calculating p^2 percentiles? The differences in the actual data I was working with are significant.

thanks!
Lucian
percentiles vs quantiles.knwf (27.2 KB)

SamirAbida · August 27, 2021, 9:23am

@lucian.cristian,

The Psquare percentile algorithm is kind of peculiar.

ABSTRACT: A heuristic algorithm is proposed for dynamic calculation of the median and other quantiles. The estimates are produced dynamically as the observations are generated. The observations are not stored;

To answer your question :

The P2 algorithm produces estimates close to those obtained by order statistics. It evolved after a series of trials with other similar designs. […] The observations are not stored; therefore, the algorithm has a very small and fixed storage requirement regardless of the number of observations. […] We, therefore, choose MSE as our primary performance metric and compare a P2 estimate of a quantile with that obtained by sorting the observations. […] Another weakness of this design is that a single outlier observation may cause the error to jump to a large value. This is due to the fact that for many statistical distributions, the minimum, and maximum are statistically “unbounded” variables in the sense that they generally do not attain a stable value as the number of observations increases. Therefore, a single outlier can cause a sudden jump in the maximum or minimum value; that is immediately reflected in the predicted value of the quantile.

Reason to use Psquare :

The main problem with this (regular quantile estimation) and other alternative approaches is that all observations must be stored. In many situations, n can be very large; also, there may be many variables whose quantiles may be required. It is this space problem that we intend to solve in this article. Instead of storing the complete distribution function, we store only five points on it and update the points as more observations are generated. We show that a very good estimate of any quantile can be obtained by locating points at the minimum, the maximum, and at the current estimates of (p/2)-, p-, and (1 + p)/2-quantiles.

You can check, the fifth chapter of this paper about it to understand why you have different values : The P-Square Algorithm for Dynamic Calculation of Percentiles and Histograms without Storing Observations (wustl.edu)

Br,
Samir

lucian.cristian · August 27, 2021, 11:17am

Thanks for the answer! I looked up that paper as well. What I don’t understand is why would I get different results just by sorting data without any other intervention.

SamirAbida · August 27, 2021, 1:11pm

@lucian.cristian,

It’s because of the way the P2P is calculated. Keep in mind that the P2P is an estimation of the percentile “algorithmically calculated”. So the order of the values have an impact :

The algorithm consists of maintaining five markers […] For the algorithm to work correctly, the marker heights must always be in a nondecreasing order.

Br,
Samir

lucian.cristian · August 27, 2021, 3:26pm

Thanks Samir! Clear now.

system · September 3, 2021, 3:27pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.