Learning the merits of the GroupBy node.

Is there anyone willing to briefly describe the GroupBy node -- and how they use it, please?

I'm wondering how different it is to the Crosstab node.

Is GroupBy only really useful if the number of column categories is huge?  For example, you're trying to count how many individuals live at reported mailing codes (so Crosstab would generate a silly number of columns/rows for hundreds of mailing codes).

Hi Jack,

does this: https://www.youtube.com/watch?v=bDwF-TOMtWw help?

Best,

Christian

Thank you, Christian.  The link did help.

When I've worked with data before, it's involved the use of a tool to filter a variable.  Another common activity has been using crosstabs to generate counts where two variables intersect.

I now understand that as an aggregation method (i.e., dropping cases into specific buckets), the GroupBy node is powerful because it can handle >2 variables.  And it doesn't limit data by filtered value.

To illustrate, I've been working on a study where students are placed into one of three math classes, and one of three reading classes.  This yields (4-1)^2 combinations of the classes into which students fall.  If we're interested in counting values for these two variables -- fine!  Use crosstabs.  But if we wanted to count student outcomes (success/failure) for the math-reading class permutations, you need to use the GroupBy node.

For anyone interested, I've attached two files to illustrate a basic workflow and the table output (limited to successes):

Math,Reading,Success

1,1,n

1,2,n

1,3,n

2,1,n

2,2,n

2,3,n

3,1,n

3,2,n

3,3,n

To answer my own question: the merits of GroupBy are (1) you can organize data into buckets based upon aggregation of multiple columns or "dimensions," and (2) all the aggregated data are retained so you can manipulate your grouped data downstream with other nodes such as Row Filter.

As I said, I found this to be a shift away from selecting cases using filters.