Solutions to "Just KNIME It!" Challenge 4 - Season 3

:sun_with_face: Hi, folks, hope you’ve been having a good week so far. Shall we solve a fresh Just KNIME It! challenge?

:stethoscope: This week you’ll study some demographic data on causes of death in the European Union. :hospital: The central dataset, however, has column names and values that, at first, don’t make a lot of sense. To gather proper insights, you’ll then have to match its structure with some given XML metadata. :mag_right: What are your findings? What regions of the EU have similar patterns or unexpected driving causes of death?

Here is the challenge. Let’s use this thread to post our solutions to it, which should be uploaded to your public KNIME Hub spaces with tag JKISeason3-4.

:sos: Need help with tags? To add tag JKISeason3-4 to your workflow, go to the description panel in KNIME Analytics Platform, click the pencil to edit it, and you will see the option for adding tags right there. :slight_smile: Let us know if you have any problems!

3 Likes

FYI - Upon inspecting the XML I noticed something other Knimers might struggle with. Colons in the attribute names:

image

To level the playing field, this has been flagged before and there is a workaround available using indirect XPath declarations like @*[name()='<attribute:name>'] . e.g.:
/rdf:Description/@*[name()='aux:Firmware'] (in courtesy of @armingrudd :wink: ). Read more in this thread:

Enjoy the challenge!

Edit: Here is a sample XPath as this aspect might be rather annoying //*[local-name()='Name' and @xml:lang='en']/text()

8 Likes

Here’s my go at this:

Dashboard with four sections:

  1. User Input and selections for certain Meta Data categories
  2. EU Map showing total deaths reported and top 5 causes of death
  3. Pie Chart showing causes of death in a pie chart for EU Total as well as Top and Bottom 5
  4. Pie Chart showing total deaths by country for a certain cause of death as well as top and bottom 5 countries for that cause in a bar chart

First execution may take a moment due to OSM Boundary Map being Looped Over.



9 Likes

Impressive. Kudos to you!

4 Likes

Hi @alinebessa,

I got aa question about the data set. The column “geo” lists two values which in English are:

  • All deaths reported in the country
  • All deaths of residents in or outside their home country

The difference of both represents the amount of death outside of the country (citizens during travel), doesn’t it?

Further inspecting the data shows that for a significant amount of rows (246,156) both values are identical. As follows an excerpt:

TEMP Max*(obs_value) Min*(obs_value)
Females, 65 years or over, Accidental drowning and submersion, Denmark, 2021, W65-W74 4 4
Females, 65 years or over, Accidental drowning and submersion, Estonia, 2021, W65-W74 5 5
Females, 65 years or over, Accidental drowning and submersion, Finland, 2021, W65-W74 19 19

This can mean two things:

  1. all reported deaths in the country happened outside of it
  2. misleading data representation as these deaths should only be reported once as “All deaths reported in the country”

Do you have more insights? Maybe I got something wrong. Hope you can shed some light on this.

Best
Mike

4 Likes

The way I read it is that if both are identical all deaths happened inside the country

  1. is only people in the country
  2. is people in the country + people outside the country

So 2 - 1 should give you any people that died travelling or living abroad. If both are the same then no one died travelling or abroad.

  • All deaths reported in the country
  • All deaths of residents in or outside their home country

Also nothing more clear when reading through the meta data…

2 Likes

@MartinDDDD I interpreted it that way too but then there are also properly reported deaths with only “All deaths reported in the country”. Anyways, let’s put that in the “weird data drawer” and forget about it since not “death threatening” (pun intended) :sweat_smile:

Update
Is my perception wrong or is the data a huge mess, resulting in substantially fewer solution submissions because Exploratory Data Analysis (EDA), as well as any other method, only works with clean data?

I am still working to get the data workable and found that there are sub-groups that make no sense. Though, I noticed that the more data I filter, the easier the data set is to understand and charts finally start to make sense. But I work more to get the data usable than actually on a solution. Like in age:

  1. String representation is not properly sortable, resulting in poor chart quality as related values are not grouped
  2. “Less than 1 year” is not found amongst the from to value ranges
  3. Does Less than 15 years in- or exclude the less than 1 year group
  4. Totals must be filtered, otherwise they mess up the charts even more

About what I found earlier regarding the geo column. That data clearly is messed up too because:

  1. Subtracting all deaths of residents in or outside their home country from all deaths reported in the country (TOT_RESID - TOT_IN) it can result in negative values
  2. Some data samples miss their counterpart representing 0, others have it present

If others agree to my perception, I’d provide a sample workflow to clean the data so we can continue with the challenge. Overall, this isn’t shouldn’t be challenging.

Please let me know your thoughts.

Update
Here are the two options to either filter/split or mark the data which represent the cumulative value groups messing up the process. Happy Kniming!

Rule Engine

$geo$ MATCHES "EU2.*" => "EU Countries"
// Deaths outside and inside of the country
// This data seems bad and represents an aggregation too
$resid$ = "TOT_RESID" => "Deaths in-/ and outside of the country"
// ICD 10 Grouped causes of Death like A15-A19_B90 OR B15-B19_B942
$icd10$ MATCHES "^[^_]+_[^_]+" => "ICD10 Grouped Causes of Death"
// Other aggregated values messing up the analysis
$age$ = "TOTAL" => "All ages"
$sex$ = "T" => "All genders"```

Rule based Row Filter
```// Grouped Values: geo = EU27_2020 & EU28
$geo$ MATCHES "EU2.*" => TRUE
// Deaths outside and inside of the country
// This data seems bad and represents an aggregation too
$resid$ = "TOT_RESID" => TRUE
// ICD 10 Grouped causes of Death like A15-A19_B90 OR B15-B19_B942
$icd10$ MATCHES "^[^_]+_[^_]+" => TRUE
// Other aggregated values messing up the analysis
$age$ = "TOTAL" => TRUE
$sex$ = "T" => TRUE```
6 Likes

Find herewith my submission for challenge .


5 Likes

Hi,
This is my first part of the workflow, completing data preprocessing.

Thank you for sharing with @mwiegand and @MartinDDDD . I created a column loop to extract all the code descriptions and proceed with preliminary data processing. Output a data file as the second part to process the dataset.

Part 1: Data processing

1).Obtain the corresponding explanations for all codes

2).Analyze the basic relationship between the values of each column

3).I think necessary filtering

4).Output processed data file

result:

003

6 Likes

Hello JKIers
This is my take for the challenge 4th. Is it me becoming slower? or are challenges increasing in data cleansing complexity?

From my workflow’s dashboard I got the following insights:

The 3 main causes of death within EU27 (2021) are, in decreasing frequency:

  • Diseases of the circulatory system (I00-I99)
  • Malignant neoplasms (C00-C97)
  • COVID-19, virus identified

The two first are common in all countries; we have to look at from the third cause and beyond, aiming to spot the differences.

Have a great coding weekend :vulcan_salute:t3:

5 Likes

I decided to calculate a cause of death frequency per country. This normalizes comparisons across countries. Using raw data counts does not account for population differences. The drawback is that I couldn’t find EU census data by sex or age groups. If its available it would make this approach more robust. If all you’re interested in is comparing death causes for a single country, raw counts are perfectly ok; but not for comparing different countries.
On 6/8 I uploaded a corrected version with the charts in a component.

3 Likes

Hi,
This is my second part of the workflow, using the output from the first part for the data.

  1. There are 5 dimensions in the data, first analyze the cause of death (icd10) dimension - a single dimension. Note that the statistics have already been completed in the data, so we only need to filter out the corresponding statistical rows.


    result:

  2. Then, analyze the second dimension (country/geo).

  • Cluster analysis, default k=3

  • In the scatter plot, I selected the first cause (I) and the second cause (C) as the x-axis and y-axis, respectively, in the overall plot. It can be seen that data can be well separated in the I dimension.

3.Extract representative countries from two categories for comparison
Please note the difference in the ranking of the main causes of death between the two countries.

4 Likes

Learnt about how to get tabular data from XML using XPath.


Quite detailed dataset, looking forward to exploring it more.

4 Likes

just to make it quick, it will have a new version.

2 Likes

Hi all,
Here is my solution.

For the data cleaning phase, I employed loops to generate XPath queries and conducted data replacement in batches (this process took some time to excecute).

The dashboard was constructed mainly using a Choropleth Map and a Sunburst chart.
The Sunburst chart is useful because as it enables easy drill-down functionality.





6 Likes

Hi @alinebessa :wave: :slightly_smiling_face:

I’ve uploaded solution for “Just KNIME It!” Challenge 4 - Season 3.

5 Likes

My main disease filtering was not written correctly,
The original regular string is:

A_B|[C-R]|S_T|U |V01-Y89|Z

Change to:

[A-Z]|A_B|G_H|S_T|U.*|Z.*|V01-Y89

Cause of death filter conditions:

  1. Statistics are based on major diseases, so capital letters A-Z are used.
  2. But there are special circumstances where some types of diseases are merged, such as A_B, G_H, and S_T.
  3. In 2021, the special epidemic was classified into category U, but there was no major category U, only minor items. So the filtering condition is “U.*” .

By the way, it seems that the code in the metadata is not comprehensive.

2 Likes

Hi all,
Here is my solution. However, I didn’t have enough time this week to create an elaborate workflow.

4 Likes

Hi all,
Here’s my solution. I had a little trouble dealing with XML, but thanks to the workflows posted by everyone here, I managed to make it. Thank you.

6 Likes

Here is my solution for Challange 4. It was a fun challenge big time! Big dataset, interesting topic, XML parsing, visualization. What more could a KNIME enthusiastic ask for? :smiley:

JKISeason3-4 – KNIME Community Hub

5 Likes