Aggregations with GroupBy node

Dear Knimers,
Hi (again). I have a problem configuring my workflow on the GroupBy node.
I got:
a) I have a [CSV file] with Covid data preprocessed in Knime, with a list of about 1.325.000 cases. The link to this file (from my Google Drive) is here below:

Each case has one column for the city name and another for its “RegionCovid” (an official name for the regionalization (grouping) of the cities and their Covid cases distribution), amongst several other columns. These cases are to be joined by the municipality per every month (“Munic-yyyy/MM”) and by “RegionCovid”, also joined by the city per every month.

b) I have downloaded another file, an XLSX one, with the population for every city in our State:
recalc-DEE-RS_populacao-byPopTot-municipio-sexo-fx-etaria-2020.xlsx (148.6 KB)

c) Next, I need to group the number of cases, hospitalizations, and deaths by “RegionCovid”, and calculate their respective rates. I have already calculated (in Knime) the rates for every city. But I didn’t get to group cases and populations by Region, in order to calculate their respective rates because I don’t get to gather the sum of inhabitants per “RegionCovid”. I tried several configurations, most of the “Manual Aggregations”, but Knime simply sums their subtotals by month, which renders me an enormous (and worse, artificial and wrong) number for its subtotal.

Would someone lend me a hand on that?

Thanks for any help.
B.r.,
Rogério.

Hi @rogerius1st

Your description is not very clean to be honest.

Regarding a), is there a difference between city and municipality or are you using both terms interchangeable? I don’t see a city column. The join that you are looking for is not very clear either. You mention joined by but since your question is about aggregation I assume you mean group by? There is not such thing as joining in the groupBy node.

So should one output be based the grouping be on columns
MUNICIPALITY=NomeMunic=MUNICÍPIO
(“Munic-yyyy/MM”) and
REG_COVID
and then another separate grouping on city (or MUNICIPALITY) plus (“Munic-yyyy/MM”) again?

What are you looking to aggregate? I assume the cases are in these columns?

If this is the case, you can use something along the lines of (.*(?=Femin|Masc|[0-9]).*) as Pattern Based Aggregation in the groupBy.

Group:

Pattern:

Result:

Regarding C, again very confusing. Which columns are your deaths?

because I don’t get to gather the sum of inhabitants per “RegionCovid”.
This is about the populacao file right? If this is case, you could do a Column Aggregation again on (.*(?=Femin|Masc|[0-9]).*)

and then sum it.

If you want to join the population sum to the cases data, split the REG_COVID column with array 2 and use the [0] output.

From the population file, clean it up and normalize it with removeDiacritic(upperCase($NomeMunic$))

If you then left outer join it with the grouped covid cases, you’ll have the regional population associated with it.

Looking at it again, you might need a lookup from the population table the cases data and join on the cities to retrieve the region, and then group on the inhabitants subsequently.

A combination of such steps with minor tweaks should get you closer to something workable, but again your description leaves so many questions open that it’s hard to imagine your expected output and which columns should be used when, etc.

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.