Data Exploration in #66daysofdata with KNIME

The #66daysofdata challenge is an interesting self-motivational approach to learning or deepening any branch of data science by dedicating 5-15 min. approx every day to complete a small task within a broader project.

Keen on deepening your knowledge of data exploration techniques? Fancy building interactive dashboards? Wonder no more, join the #66daysofdata challenge about data visualization with KNIME! You can find an overview of the project on the KNIME Blog (out on September 20th).

Feeling lost? Need help? Unsure how to proceed or configure a node? Feel free to post your questions in this forum thread or simply navigate the KNIME Forum - very likely someone else had a similar question before!

The Moderators of the #66daysofdata with KNIME


“Perseverance is not a long race; it is many short races one after the other.”
– Walter Elliot

6 Likes

Eagerly waiting for the Challenge

2 Likes

The overview of the #66daysofdata journey is finally out! Find it on the KNIME Blog.

6 Likes

Some of you have been asking how to change the workflow’s metadata. This is absolutely useful for documentation, especially giving your workflow a meaningful title, description and tags. Have a look at the picture below if you want to know more.

2 Likes

Day 10. Using the Data Explorer node, what is the average popularity of the songs in the tracks.csv dataset? What is the average speechiness? And what can this mean?

3 Likes

Day 11. The histogram. This question came up on Twitter. Where to use the equal frequency…and what is the right bin size?

2 Likes

@rs1 Day 10. Using the Data Explorer node, what is the average popularity of the songs in the tracks.csv dataset? What is the average speechiness? And what can this mean?

The mean of those features is shown in the picture below (originally tweeted here by @humanoid_ivan)

The popularity of a song ranges from 0 to 100. An average value of 27.57 could sound low, but we probably have to consider the huge amount songs available out there and how few actually manage to break through.

I was instead surprised by the very low average of the speechness (0.105). Apparently words are really needless sometimes.

3 Likes

@rs1 Day 11. The histogram. This question came up on Twitter. Where to use the equal frequency…and what is the right bin size?

In the Binning tab of the Histogram node we can set the number of bins we want and whether they should be divided by equal width or frequency. Notice that it is possible to name bins according to the range they represent (Bin Naming: Borders).

Equal Width Histogram

The equal width binning is the most common one for visual representation: dividing the data range into same-size bins, we can clearly see the data distribution. In the example below, we can see how most of the values fall into the [0, 0.139] range. Also notice the long tail due to few, very high values.

Equal Frequency Histogram

The equal frequency binning is less common when it comes to histograms, since it tends to generate a more flat visualization, like in the example below. However, information about the data distribution can be observed on the x-axis. Notice how the range of each bin is small, except for the very last one, which had to be “stretched” in order to include the higher values.

Although not particularly interesting for histograms per se, binning by equal frequency is still a valuable option when dealing with outliers.

Number of bins

The number of bins vary according to the data you are dealing with and the information you want to convey. There is no real fixed rule. The easiest way is trying out different values and visually inspecting the results.

2 Likes