Solutions to “Just KNIME It!” Challenge 28 - Season 4

:sun_with_face: Hello, everybody! Today we have a new Just KNIME It! challenge on data profiling.

:necktie: Your company is planning to upgrade its office setup with new chairs, desks, monitors, and other essentials. To help with the decision, someone scraped product details and reviews to start an analysis, but before trusting this data you’ve been asked to assess its quality. Your task is to evaluate two datasets (product details and product reviews) and create a data quality profile for each. :mag: Can you trust this scraped data?

Here is the challenge. Let’s use this thread to post our solutions to it, which should be uploaded to your public KNIME Hub spaces with tag JKISeason4-28 .

:sos: Need help with tags? To add tag JKISeason4-28 to your workflow, go to the description panel in KNIME Analytics Platform, click the pencil to edit it, and you will see the option for adding tags right there. :blush: Let us know if you have any problems!

1 Like

(post deleted by author)

Hey @armingrudd :waving_hand: :slightly_smiling_face:

Just to confirm for Challenge 28, is there only one dataset file available right now? Because the task says we’re supposed to evaluate two datasets (product details and product reviews).

1 Like

Hi Arief….I noticed excel file has two tabs…product details and product reviews.

that must be the two datasets.

2 Likes

Thanks, @garcbcpa Appreciate the clarification :folded_hands:

Turns out I wasn’t paying enough attention it’s still 5 AM here in Jakarta, and I guess my brain wasn’t fully loaded yet.

Looks like I definitely need breakfast and a cup of coffee before checking datasets next time :hot_beverage::joy:

1 Like

Hi all,

This is a challenging one, especially if you want to avoid the python nodes for calculation :wink:

Here is my solution: JKISeason 4-28 - Can You Trust Your Office Equipment Data – KNIME Community Hub (sorry, workflow is a bit messy).

The steps:

  • Clean the data for product and review
  • Do the measures of Conformity, Duplicate and Completeness per data. I wanted to avoid doing the calculation on the full data as the product will appears many times once joined with the reviews. Had to process the data and do some pivot to get the percent,
  • Filter the data based on these criteria and keep “good data only”
  • Join the products and reviews to have a final full table.
  • Visualize the results

This is a simple visu, only showing number in tables. It could be improved.
I decided for this one to show the stats per dataset - knowing the quality of each dataset allows the data scientist to improve the part that is not perfect. And when possible, showing the measure per column - again, only for some of them for the exercise. It could be extended way more.

Happy to get your feedbacks team!

Cheers

Jerome