Compare two Time Series (TS)

Dear Knimers,
I have a situation with Time Series. My questions and the data I’m working on are as follows:
1- a Covid Database (CSV, open access) with data from cases of all municipalities in my State (RS, Brazil). The original data are at:
< https://ti.saude.rs.gov.br/covid19/download >
Preprocessing data, sorted by confirmation dates (1 case by row), using 8 variables (one by column), brought me about 1,330,000 cases of interest.
I wrote the following Knime workflow (uploaded and image) for preprocessing data:
< KNIME_project12.knwf (79.7 KB) >
<

>

2- I used additional open datasets (XLXS) with other data (such as population age groups and regionalization of public health administration) for every specific city:
< RS-RegioesMacrorregioesSaude.xlsx (32.6 KB) >
< recalc-DEE-RS_populacao-municipio-sexo-fx-etaria-2020.xlsx (136.9 KB) >;

3- a file (XLXS) with a sample of results in a survey made with small businesses, containing 8 metrics for their customers and financial inflow/outflow. These metrics have normalized data for 16 months. This survey was applied by Google Forms, and the results were sent to me with 1 column by month for each metric. The file is:
< Anonim+Input+Valid - Respostas survey e dados Munic.xlsx (319.0 KB) >;

4- I need to compare cities or regions using rates (because of great differences in the number of inhabitants), I plan to use: a) monthly incidence (= (number of monthly new cases) / 10.000 inhab. (to evaluate dissemination); b) monthly hospitalization rates (number of monthly new cases) that lead to patient hospitalization (to evaluate severity); and c) lethality rates (number of monthly new cases) that lead to death. Can someone help with these calculations?

5- I wish also to compare both timelines: a) Covid’s vs b) my Survey results, searching for correlations between two time series, suggesting the possible influence of pandemics on business metrics (with or without lags between increases or decreases on Covid dissemination and severity of cases).

6- considering that: a) original data are not in English (they are in Portuguese(BR)), and b) my files are also in Portuguese, I wrote this additional DOCX file for translating variables and applied terminology:
< Terminology for columns headers - Covid Knime Project.docx (18.8 KB) >.

Can someone help me on comparing these two time series?

Thanks in advance.
Rogério.

hi @rogerius1st
to address the issue you mention in (4), I think you need to group rows. For example, if you want to calculate the overall incidence of cases by city, year, month you must add to your workflow

This is the complete workflow.

3 Likes

the issue (5) is a bit more complex. Google Forms returns a column for each question, be it single or multiple choice, so you need to “re-shape” the resulting table by adding some steps to your workflow

Recipe:

A) filter columns not “time-dependant”, together with IDResp. We’ll use them later
B) in order to have 1 row per response (22) per unit of time (16) per variable (8) unpivot the time-dependant columns. You’ll get 22 x 16 x 8 = 2816 rows
C) with 3 column expressions add month, year, target column name (that is, column name without month/year)
D) keep only useful columns and pivot them, then rename the resulting ones
E) join the “non time-dependant columns” table on IDResp
F) double check the result, since I ain’t no divine entity
G) let me know if it works
KNIME_covid_brazil.knwf (104.2 KB)

I forgot to add a “Column Auto Type Cast” node, to set the column types of the new columns
KNIME_covid_brazil.knwf (106.3 KB)

Dear @durisTef,
Sorry for the delay in my answers, but (being brand new in programming) I tried to understand the meaning of all commands integrating your nodes’ configuration until your thinking line meant something to me, so I could make the necessary arrangements to go further on my own.
Let’s see what I’ve done:

  1. I re-executed the workflow implementing your added nodes but I didn’t fully understand the configuration of GroupBy node. Could you explain these settings? Our State (as well as my data organization) has Health Administrative divisions called “Region_Covid”, each one grouping several municipalities (by neighborhood, and physical proximity). When you made ‘group’ by municipalities, what does it mean “to aggregate (counting) by Reg_Covid”? This “Manual Aggregation” seems to be at a higher (and not at a smaller) level of granularity than by municipalities… My results appeared (in a table, as you have shown), ordered by months and years, but not with a list of cities grouped by a Region_Covid. So, what I saw is a repeated list of cities, repeated once by month (i.e., each list repeated 16 times, in almost a total of almost 6800 rows with repeated names of cities, when we have 497 distinct cities → 497*16=79). I could not follow your configuration, aiming to reach: a) the Incidence calculus (number of new cases by month *10,000); b) hospitalization (percentual) rate [count of values “Sim” (= “Yes”) in column “Hospitalizado” / (number of new cases)] *100; and c) lethality (percentual) rate [count of values “Obito” (= “Death”) in column “Evolucao” / (number of new cases)] *100.
    I explain myself:
  2. I need to compare municipalities with large difference in total populations. This way I chose the rates as a tool to change uncomparable into comparable populations.
  3. To calculate rates, I need to have individual counting of city population, positive cases, hospitalizations and deaths.

Thank you once again.
B.R.,
Rogério.

let’s see if I understand correctly.

Each row of the dataset represents a case, which occurred in a certain year / month in a certain municipality. In the same row there is the population of the municipality, identical in each row that refers to that municipality.
We want to calculate, for each municipality, year and month

  • the incidence = (number of cases in the month / municipality population) * 10,000
  • hospitalization rate
  • lethality rate

A - First we have to count the total cases in each municipality in each month of each year. To do this we use the Groupby node and we group by year, month and municipality.
To count the number of rows we use a trick: we count the number of values ​​present in any column, for example Reg_COVID. The population of each municipality is always the same, so we take the first value in the group (we could take the max or min value as well)
B - In a parallel branch we filter the dataset in order to obtain only the “SIM” and then group in the same way, this time omitting the population
C - In a parallel branch we filter the dataset in order to obtain only the “Obitos” and then we group as in B
D - We combine the three tables. We now have all the data we need to calculate the rates
E - We calculate the three rates and rearrange the resulting dataset
This is the final part of the workflow

This is the result (part of)

KNIME_covid_brazil.knwf (124.0 KB)

3 Likes

Dear @duristef,

  1. Yes, You understood it very correctly. Thus, I followed your suggestions, readapted my workflow, and changed it to group by “Covid Regions” (instead of municipalities). Now, it worked out fine. I got to calculate the rates as I needed, and as you taught me. Thanks a lot. 2) Now, having the time series grouped as I expected, I need to plot them along time (having dates (YYYY-MM) on the x-axis and rates (maybe by color) on the y-axis). How can this be done? I tried to use the Color management and Line Plot nodes but didn’t get a graph with a clear representation. Also, reading the table, I see no direct correlation between pairs of rate columns (incidence vs hospitalizations or hospitalizations vs lethality and so on) → I tried to use statistic nodes for correlation (Rank correlation and Linear correlation), but didn’t understand or knew how to use those results nor how to extract information from them. Can you help me?
  • WARNING: in “DEE-RS - Age Groups - Cities” there’s a Municipality called “RIO GRANDE DO SUL” which doesn’t match any Municipality in the other two tables. When you join the tables it creates a row filled with missing values.
  • since you group by Region, you’d better aggregate Municipality pop data by REG_COVID before joining it to the join of the other two tables

I’ve rearranged the upper section of the original workflow to take into account the new type of aggregation. The “One to Many” node

transforms all possible values in a selected column each into a new column. The value is set as the new column’s name, the cell values in that column are either 1, if that row contains this possible value, or 0 if not.

This makes it easier to count Hospitalizations, Deaths and Cases
The workflow creates an Excel file. I’ll use it to plot the rates by time and REG_COVID in next post

KNIME_covid_brazil v2.0.knwf (66.1 KB)

2 Likes

I suggest you save your intermediate results in a spreadsheet and use it as an input to your subsequent processing. Here’s the spreadsheet I saved and a (very simple) example of line plot, showing the trend of incidence (calculated as % rate), death and hospitalization rates by Reg_Covid.
covid_brazil.xlsx (20.2 KB)
KNIME_covid_brazil plot.knwf (39.4 KB)

I suspect that, given the time lag between the phenomena detected, the aggregation on a monthly basis could be distorting.
I don’t think I can help you beyond this point. For further, detailed suggestions on the treatment of COVID-related time series, I refer you to these examples covid – KNIME Hub, especially this one

Dear @duristef,
I thank you, most of all, for all your time investment, and your expertise on Knime for helping me to overcome the important preprocessing steps I didn’t know how to surpass.
I also understand the limits ( both in time and availability) of your generous offer to solve my problems. And thank you for going so far on helping me to go further on my analysis.
BTW, I wish to publicly thank you (on my master thesis) for your priceless help (or even guidance) on my journey to translate my thoughts and needs into an efficient workflow to process my data. If you authorize me to do so, how should I mention your name?
B.R.,
Rogério.

2 Likes

@rogerius1st , that’s very kind of you, but “duristef” is enough. I truly hope I have been of help and I wish you that your thesis goes on well.

O.K., Duristef,

If you do not prefer to be mentioned otherwise…

But, yes, your help has been of enormous value for me to overcome those steps I was locked in during preprocessing my data.

Next, I have to select, configure and apply a few algorithms for:

  1. Clustering those raw data, in order to find appropriate classes to describe groups of cases (or cities or regions) with common points distinctive enough to split them into clearly separate and cohesive clusters; and
  2. Training ML models of cities (or regions, or clusters) which might suffer the influence of neighboring COVID rates (incidence, hospitalization and lethality rates) for predicting the intracluster effects on local metrics of inflow and outflow (of patients and cash).

I’m afraid I’m asking too much of you for any additional help. But, anyway, do you have any suggestion for me to follow?

Once again I thank you so, SO MUCH for all your precious time, knowledge and attention that you have offered to solve my problems.

I wish you all well.
Best regards,
Rogério.

1 Like

To answer your questions higher skills than mine are needed, but I believe you can find answers and suggestions in the posts of other members of the community much more competent than me

Dear @duristef,
Thank you once again for your inestimable contribution, which helped me to overcome some important steps. without it, on my own, I probably could not go any further, and would not have reached my current status in my research.
From now on, I will follow your suggestions and make additional questions on Knime Forum. Like you, I believe I will find other generous Knimers that have already faced problems similar to mine and will help me with my remaining unsolved questions.

I wish all the best for you.
Rogério.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.