Solutions to “Just KNIME It!” Challenge 28 - Season 4

:sun_with_face: Hello, everybody! Today we have a new Just KNIME It! challenge on data profiling.

:necktie: Your company is planning to upgrade its office setup with new chairs, desks, monitors, and other essentials. To help with the decision, someone scraped product details and reviews to start an analysis, but before trusting this data you’ve been asked to assess its quality. Your task is to evaluate two datasets (product details and product reviews) and create a data quality profile for each. :mag: Can you trust this scraped data?

Here is the challenge. Let’s use this thread to post our solutions to it, which should be uploaded to your public KNIME Hub spaces with tag JKISeason4-28 .

:sos: Need help with tags? To add tag JKISeason4-28 to your workflow, go to the description panel in KNIME Analytics Platform, click the pencil to edit it, and you will see the option for adding tags right there. :blush: Let us know if you have any problems!

2 Likes

Hey @armingrudd :waving_hand: :slightly_smiling_face:

Just to confirm for Challenge 28, is there only one dataset file available right now? Because the task says we’re supposed to evaluate two datasets (product details and product reviews).

1 Like

Hi Arief….I noticed excel file has two tabs…product details and product reviews.

that must be the two datasets.

4 Likes

Thanks, @garcbcpa Appreciate the clarification :folded_hands:

Turns out I wasn’t paying enough attention it’s still 5 AM here in Jakarta, and I guess my brain wasn’t fully loaded yet.

Looks like I definitely need breakfast and a cup of coffee before checking datasets next time :hot_beverage::joy:

2 Likes

Hi all,

This is a challenging one, especially if you want to avoid the python nodes for calculation :wink:

Here is my solution: JKISeason 4-28 - Can You Trust Your Office Equipment Data – KNIME Community Hub (sorry, workflow is a bit messy).

The steps:

  • Clean the data for product and review
  • Do the measures of Conformity, Duplicate and Completeness per data. I wanted to avoid doing the calculation on the full data as the product will appears many times once joined with the reviews. Had to process the data and do some pivot to get the percent,
  • Filter the data based on these criteria and keep “good data only”
  • Join the products and reviews to have a final full table.
  • Visualize the results

This is a simple visu, only showing number in tables. It could be improved.
I decided for this one to show the stats per dataset - knowing the quality of each dataset allows the data scientist to improve the part that is not perfect. And when possible, showing the measure per column - again, only for some of them for the exercise. It could be extended way more.

Happy to get your feedbacks team!

Cheers

Jerome

4 Likes

Find my submission for the challenge : JKISeason4-28 – KNIME Community Hub

3 Likes

Made some use of the interactive data cleaning component at: Interactive Data Cleaning – KNIME Community Hub

…but the simple Statistics View node gave the best entry into the data.
Some columns are better treated as numbers rather than strings so some processing of the ratings and prices is needed.

.

5 Likes

Greetings KNIMERs

Here is my solution to this week’s challenge.
JKISeason4-28

I have to agree with @trj that this challenge was a tough one. I definitely had to leverage from the Lesson 3 self paced L4-DA Data Analytics and Visualization: Specialization and utilize the Data Quality Profile components that were built for this type of data quality metrics (i.e. completeness, uniqueness, etc). @jproudfoot111 also mentioned the use of the interactive data cleansing component and I also leverage this component into my workflow.

My approach to this challenge was to do a before/after data quality analysis based on the component tools provided.
The Data Quality Profile components were used prior to data cleansing tools. The Interactive Data Cleansing component is utilized for the dataset after data cleansing tools.
I then rolled up both component tools into one visualization for a comprehensive quality profile for the datasets.

I’m not sure if this is the correct approach but I had fun trying. Cheers

Here’s my solution. I have no previous experience with these quality metrics. I used the following definitions generated by Gemini:

To calculate data quality, you use a set of metrics. Completeness is calculated by dividing the number of fields with data by the total number of fields. Uniqueness is measured by dividing the number of unique records by the total number of records. Conformity (or validity) is calculated by dividing the number of records that conform to predefined rules by the total number of records.

Taken literally, there were no predefined rules provided, so I didn’t attempt to calculate Conformity. By manual inspection, all of the data sets were the same. so Conformity should be 100%. Although it wasn’t part of this challenge, I extracted the review locations and dates so they could be included in further modelling (see below.) The biggest problem would appear to be that ~10% of the item prices are missing.

2 Likes

My solution to the challenge:

To be honest I didn’t do data quality analysis in KNIME yet, but surely it’s real easy.

My approach was:

  • I read the two datasets
  • Changed the the appropriate columns to the proper type
    • Price, rating, global rating, review score to number
  • Split the Review date and loc
    • To official date
    • To Location
  • After that I have calculated the Completess, Uniqueness, Conformity scores
  • In the output I have an overall page and a per column page

As I have extracted the locations I wanted to use some geo analytics (which could be expanded if I had more time), but I have created a map regarding the countries (it’s just the extra :smiley: )

This challenge was real fun :slight_smile:

4 Likes

The location extraction was a real challenge for me as well. Ultimately I have used the string splitter (regex) node, which can split the text based on rules:

I used the node with this regex: “^Reviewed in (?:the )?(.*?) on (.+)$”, which is criptic as it can be, but chat gpt has no problem to understand (or generate) it :smiley:

2 Likes

Quite impressive.. esp the consolidated and view..

2 Likes

Thank you @AnilKS , you are too kind…

2 Likes

You folks really killed it this week, wow!

2 Likes

Here is my solution, I really did the bare minimum this week. I have simply copied the component from the L4 course and cleaned the data with the Expression node. It took me a while to understand the Bytefreq column from the component. I am quite happy to have this component now. It is also a great workflow to teach basic data management concepts as well as data stewardship. Many thanks for this challenge!

4 Likes

:sun_with_face: Our solution to last week’s Just KNIME It! challenge is out! :boom:

:nerd_face: This was a tough challenge, but you all went above and beyond with composite visualizations that really helped assess the datasets’ quality. :bar_chart: :chart_increasing: It was also interesting to notice that even simpler nodes, such as Statistics View, brought a lot of value to some solutions.

:police_car_light: Now an important announcement: there is no new Just KNIME It! challenge coming out tomorrow! Just like last year, we are giving ourselves and our community a Thanksgiving break. We will be back with a new challenge on December, 3rd – stay tuned! :eyes:

2 Likes

Hi @armingrudd

Was that the last one of the season?

No, we’ll have a break this week and continue next week. There are 2 challenges left.

1 Like

Hi folks, thanks for sharing your solutions - great ones! :slight_smile:

I see that some of your shared different approaches concerning the country extraction - something not so trivial. I want to share (yet another) approach that uses a relatively new node in the Geospatial Analytics extension. The node is called Natural Earth Global Data ( Natural Earth Global Data – KNIME Community Hub ) and, among other things, retrieves the full name of sovereign countries.

In this way, you have a full list of country names in seconds and can perform a substring match with the Value Lookup node (check the advanced settings to find this config).

Hope it helps :slight_smile:

4 Likes

Hi @armingrudd

I noticed a Java heap space error on the Data Quality Profile component for the “Review” dataset that is sourced Metanode for Frequency Bar Chart processing. I had to remove the Sorter node in this part of the workflow and that seemed to work for me when running the workflow.

Just sharing some feedback here. Thanks, BG

2 Likes