Groupby node use similar to R Groupby()

So I have R code that is a table of 12 variables.

The code groups the table by a number of variables:

group_by( day_snc_install, media_source, cohort_day, week_instal, name, app_id)

The resulting table is:

I then summarize values with the following code:

summarise(tot_day_earn=sum(value), users=sum(users), per_user=(tot_day_earn/users))

The resulting table is:

So I’m trying to replicate this in Knime. When I try in Knime, if I do all the grouping in one Groupby node:

I end up with this before the join to the original data:

But after the join, I end up with 25 million rows rather than the 15,364 rows from the R code.

The join configuration is:


Is my groupby with Knime nodes incorrect or is the joiner setup incorrect? Do I need to do smaller successive groupbys joining each with the original table?

Why are you trying to join only on week_instal?
This doesn’t strike me as a unique record identifier.

2 Likes

Well it’s not unique. After the group, the calculations are done on the groupings (summarize) and then the results are appended to all weeks of the same number. Like all week_4 would get the same max date, calculations etc.

The R code going further back is:

actual_earing ← retention %>%
group_by(week_instal) %>% ## group retention by week_instal
mutate(max_date=max(cohort_day)) %>% ## add a column max_date which is the max value of cohort_day for each of the groups
filter(max_date==cohort_day) %>% ## take only those rows where max_date = cohort_day
group_by(week_instal,max_date) %>% ## Group by week_instal, then by max_date
summarise(max_days=max(day_snc_install)) %>% ## find the max of day_snc_install and set for max_days for each of the groups
ungroup() %>%
inner_join(retention) %>% ## join the original retention table with the results. Wil join on week_instal
filter(ifelse(week_instal==max(week_instal),day_snc_install<6 ,day_snc_install<=max_days)) %>%
filter(!(app_id==‘Android’ & week_instal<4)) %>%
group_by( day_snc_install,
media_source,
cohort_day,
week_instal,
name,
app_id) %>%
summarise(tot_day_earn=sum(value),
users=sum(users),
per_user=(tot_day_earn/users))

After the
summarise(max_days=max(day_snc_install))
it looks like:
image

That is then joined with the original table “retention”

resulting in each week_instal row being appended with the data from the summarise

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.