Solutions to "Just KNIME It!" Challenge 4 - Season 3

alinebessa · June 5, 2024, 1:33pm

Hi, folks, hope you’ve been having a good week so far. Shall we solve a fresh Just KNIME It! challenge?

This week you’ll study some demographic data on causes of death in the European Union. The central dataset, however, has column names and values that, at first, don’t make a lot of sense. To gather proper insights, you’ll then have to match its structure with some given XML metadata. What are your findings? What regions of the EU have similar patterns or unexpected driving causes of death?

Here is the challenge. Let’s use this thread to post our solutions to it, which should be uploaded to your public KNIME Hub spaces with tag JKISeason3-4.

Need help with tags? To add tag JKISeason3-4 to your workflow, go to the description panel in KNIME Analytics Platform, click the pencil to edit it, and you will see the option for adding tags right there. Let us know if you have any problems!

mwiegand · June 5, 2024, 3:13pm

FYI - Upon inspecting the XML I noticed something other Knimers might struggle with. Colons in the attribute names:

To level the playing field, this has been flagged before and there is a workaround available using indirect XPath declarations like @*[name()='<attribute:name>'] . e.g.:
/rdf:Description/@*[name()='aux:Firmware'] (in courtesy of @armingrudd ). Read more in this thread:

Enjoy the challenge!

Edit: Here is a sample XPath as this aspect might be rather annoying //*[local-name()='Name' and @xml:lang='en']/text()

MartinDDDD · June 5, 2024, 6:43pm

Here’s my go at this:

Dashboard with four sections:

User Input and selections for certain Meta Data categories
EU Map showing total deaths reported and top 5 causes of death
Pie Chart showing causes of death in a pie chart for EU Total as well as Top and Bottom 5
Pie Chart showing total deaths by country for a certain cause of death as well as top and bottom 5 countries for that cause in a bar chart

First execution may take a moment due to OSM Boundary Map being Looped Over.

rfeigel · June 6, 2024, 3:54am

Impressive. Kudos to you!

mwiegand · June 6, 2024, 10:45am

Hi @alinebessa,

I got aa question about the data set. The column “geo” lists two values which in English are:

All deaths reported in the country
All deaths of residents in or outside their home country

The difference of both represents the amount of death outside of the country (citizens during travel), doesn’t it?

Further inspecting the data shows that for a significant amount of rows (246,156) both values are identical. As follows an excerpt:

TEMP	Max*(obs_value)	Min*(obs_value)
Females, 65 years or over, Accidental drowning and submersion, Denmark, 2021, W65-W74	4	4
Females, 65 years or over, Accidental drowning and submersion, Estonia, 2021, W65-W74	5	5
Females, 65 years or over, Accidental drowning and submersion, Finland, 2021, W65-W74	19	19

This can mean two things:

all reported deaths in the country happened outside of it
misleading data representation as these deaths should only be reported once as “All deaths reported in the country”

Do you have more insights? Maybe I got something wrong. Hope you can shed some light on this.

Best
Mike

MartinDDDD · June 6, 2024, 11:32am

The way I read it is that if both are identical all deaths happened inside the country

is only people in the country
is people in the country + people outside the country

So 2 - 1 should give you any people that died travelling or living abroad. If both are the same then no one died travelling or abroad.

All deaths reported in the country
All deaths of residents in or outside their home country

Also nothing more clear when reading through the meta data…

mwiegand · June 6, 2024, 11:52am

@MartinDDDD I interpreted it that way too but then there are also properly reported deaths with only “All deaths reported in the country”. Anyways, let’s put that in the “weird data drawer” and forget about it since not “death threatening” (pun intended)

Update
Is my perception wrong or is the data a huge mess, resulting in substantially fewer solution submissions because Exploratory Data Analysis (EDA), as well as any other method, only works with clean data?

I am still working to get the data workable and found that there are sub-groups that make no sense. Though, I noticed that the more data I filter, the easier the data set is to understand and charts finally start to make sense. But I work more to get the data usable than actually on a solution. Like in age:

String representation is not properly sortable, resulting in poor chart quality as related values are not grouped

image1772×1149 200 KB
“Less than 1 year” is not found amongst the from to value ranges

image1171×869 246 KB
Does Less than 15 years in- or exclude the less than 1 year group
Totals must be filtered, otherwise they mess up the charts even more

About what I found earlier regarding the geo column. That data clearly is messed up too because:

Subtracting all deaths of residents in or outside their home country from all deaths reported in the country (TOT_RESID - TOT_IN) it can result in negative values
Some data samples miss their counterpart representing 0, others have it present

If others agree to my perception, I’d provide a sample workflow to clean the data so we can continue with the challenge. Overall, this isn’t shouldn’t be challenging.

Please let me know your thoughts.

Update
Here are the two options to either filter/split or mark the data which represent the cumulative value groups messing up the process. Happy Kniming!

Rule Engine

$geo$ MATCHES "EU2.*" => "EU Countries"
// Deaths outside and inside of the country
// This data seems bad and represents an aggregation too
$resid$ = "TOT_RESID" => "Deaths in-/ and outside of the country"
// ICD 10 Grouped causes of Death like A15-A19_B90 OR B15-B19_B942
$icd10$ MATCHES "^[^_]+_[^_]+" => "ICD10 Grouped Causes of Death"
// Other aggregated values messing up the analysis
$age$ = "TOTAL" => "All ages"
$sex$ = "T" => "All genders"```

Rule based Row Filter
```// Grouped Values: geo = EU27_2020 & EU28
$geo$ MATCHES "EU2.*" => TRUE
// Deaths outside and inside of the country
// This data seems bad and represents an aggregation too
$resid$ = "TOT_RESID" => TRUE
// ICD 10 Grouped causes of Death like A15-A19_B90 OR B15-B19_B942
$icd10$ MATCHES "^[^_]+_[^_]+" => TRUE
// Other aggregated values messing up the analysis
$age$ = "TOTAL" => TRUE
$sex$ = "T" => TRUE```

AnilKS · June 7, 2024, 2:40am

Find herewith my submission for challenge .

tomljh · June 7, 2024, 6:58am

Hi,
This is my first part of the workflow, completing data preprocessing.

Thank you for sharing with @mwiegand and @MartinDDDD . I created a column loop to extract all the code descriptions and proceed with preliminary data processing. Output a data file as the second part to process the dataset.

Part 1: Data processing

1).Obtain the corresponding explanations for all codes

2).Analyze the basic relationship between the values of each column

3).I think necessary filtering

4).Output processed data file

result:

003

gonhaddock · June 7, 2024, 12:02pm

Hello JKIers
This is my take for the challenge 4th. Is it me becoming slower? or are challenges increasing in data cleansing complexity?

From my workflow’s dashboard I got the following insights:

The 3 main causes of death within EU27 (2021) are, in decreasing frequency:

Diseases of the circulatory system (I00-I99)
Malignant neoplasms (C00-C97)
COVID-19, virus identified

The two first are common in all countries; we have to look at from the third cause and beyond, aiming to spot the differences.

Have a great coding weekend

rfeigel · June 8, 2024, 3:32am

I decided to calculate a cause of death frequency per country. This normalizes comparisons across countries. Using raw data counts does not account for population differences. The drawback is that I couldn’t find EU census data by sex or age groups. If its available it would make this approach more robust. If all you’re interested in is comparing death causes for a single country, raw counts are perfectly ok; but not for comparing different countries.
On 6/8 I uploaded a corrected version with the charts in a component.

tomljh · June 8, 2024, 4:41am

Hi,
This is my second part of the workflow, using the output from the first part for the data.

There are 5 dimensions in the data, first analyze the cause of death (icd10) dimension - a single dimension. Note that the statistics have already been completed in the data, so we only need to filter out the corresponding statistical rows.

0011824×213 47.3 KB

result:

0022340×1350 213 KB
Then, analyze the second dimension (country/geo).

Cluster analysis, default k=3

0031734×758 340 KB
In the scatter plot, I selected the first cause (I) and the second cause (C) as the x-axis and y-axis, respectively, in the overall plot. It can be seen that data can be well separated in the I dimension.

0042326×1352 78.5 KB

3.Extract representative countries from two categories for comparison
Please note the difference in the ranking of the main causes of death between the two countries.

Leo_Wynter · June 8, 2024, 8:15am

Learnt about how to get tabular data from XML using XPath.

Quite detailed dataset, looking forward to exploring it more.

veniapputrii · June 8, 2024, 4:38pm

just to make it quick, it will have a new version.

jm1950sjr · June 8, 2024, 4:59pm

Hi all,
Here is my solution.

For the data cleaning phase, I employed loops to generate XPath queries and conducted data replacement in batches (this process took some time to excecute).

The dashboard was constructed mainly using a Choropleth Map and a Sunburst chart.
The Sunburst chart is useful because as it enables easy drill-down functionality.

arief_rama · June 9, 2024, 12:01am

Hi @alinebessa

I’ve uploaded solution for “Just KNIME It!” Challenge 4 - Season 3.

tomljh · June 9, 2024, 4:46am

My main disease filtering was not written correctly,
The original regular string is:

A_B|[C-R]|S_T|U |V01-Y89|Z

Change to:

[A-Z]|A_B|G_H|S_T|U.*|Z.*|V01-Y89

Cause of death filter conditions:

Statistics are based on major diseases, so capital letters A-Z are used.
But there are special circumstances where some types of diseases are merged, such as A_B, G_H, and S_T.
In 2021, the special epidemic was classified into category U, but there was no major category U, only minor items. So the filtering condition is “U.*” .

By the way, it seems that the code in the metadata is not comprehensive.

sryu · June 9, 2024, 8:00am

Hi all,
Here is my solution. However, I didn’t have enough time this week to create an elaborate workflow.

tark · June 9, 2024, 3:30pm

Hi all,
Here’s my solution. I had a little trouble dealing with XML, but thanks to the workflows posted by everyone here, I managed to make it. Thank you.

berti093 · June 9, 2024, 7:27pm

Here is my solution for Challange 4. It was a fun challenge big time! Big dataset, interesting topic, XML parsing, visualization. What more could a KNIME enthusiastic ask for?

JKISeason3-4 – KNIME Community Hub