How to pivot Tags of Time tracking records

schalex · October 6, 2022, 8:28pm

Good Evening to everyone,

I am working on some time tracking records of my personal daily activities.
I prepared a minimal example to show you the structure of the dataset:
Time Tracking Minimal.knwf (70.8 KB)

As you can see, each row has a Tracking_Day, Activity, Tags and Duration. For the tags, I am using up to four as a maximum in the full data set. In My minimal example, there are maximum two tags in the columns Row3, 9, 11, 13 & 14.

I would like to have at the end a table which has the Date and Activity separated, and the Tags pivoted with the Duration divided to each Tag (if a tracking has three tags, it will be divided by three for each tag column).
For example, the first 4 rows could look like this (of course differently formatted, just tried to put it to excel to show what I mean):

As you can see in my minimal workflow, I already tried to start with the pivot function, but I am not happy with the result so far because I do not know how to continue.
It would be great to have the time not summed up but separate in a collection or list, because I would like to later count also the elements before summing them up.

Therefore, At the final End, I would like to have a table with a ROW2 like this for Besserwerden:

My Challenge is right now, I do not know how to transform the Tags and I do not know how to transform the lists to a count and a sum of the Duration.

I am happy about every feedback and idea to solve this challenge

Have a good evening every one.

Greetings Alex

ArjenEX · October 6, 2022, 9:49pm

Hello @schalex

Some of the numbers that you present as expected output do not make a lot of sense at the moment.

Let’s take 2022-08-19. You have 4 activities:

If I consider the duration of Besserwerden, it’s ~50 minutes plus 30 minutes which is 1H20M like you have in your Besserwerden (Summe). Of those 4 activities, #health is also present in 3 of them. Yet you only account for a single record of 40 minutes (which presumable comes from the one that only has #health as tag.

This selectiveness appears quite contradicting to me considering what you wrote here.

Moreover, in your current workflow you actually account for the 3 instances of #health

Can you please confirm your expected output?

Regardless, would this be a workable output for you?

Note: applied a column filter to Health and Besserwerden for illustration purposes.

If so, I will elaborate on a potential way to approach it later.

schalex · October 7, 2022, 4:22am

Dear @ArjenEX,
Thank you so much for replying so fast and showing me this. I made a mistake with the example because I have not seen the #Health Tag in the other rows, when preparing the example output.

You are totally right with your example at the 2022-08-19.

In the current Workflow I posted, the time entries are correct shifted, only the time entry dividing by the number of tags in the row, is not working, as well as, the aggregation of tag columns after the join to one column of each tag.

Your example output looks very much what I would like to get at the end of the transformation.

Really, thank you so much for replying me so fast and helping me to solve this challenge.
I am so happy to have your support

Greetings Alex

ArjenEX · October 7, 2022, 8:35am

HI @schalex

Roger that!

To start, try to always name your nodes in a KNIME according to what they are doing. In smaller workflows it’s not a big issue but once they grow larger you’ll find yourself wasting a lot of time trying to find again where a particular transformation took place.

This is a way to do it. Perhaps can de done quicker but it gets the job done.

Some noteworthy items:

Since you want to apply a calculation to the duration, I changed the Date&Time difference from duration to granularity of seconds. This will make life a lot easier down the line since durations are a bit tricky to work with.

I changed your cell splitter from # to \n. That eliminates the additional column. Make sure to escape it. This keeps the hash, but it will be of good use later on.

Since you want to evaluate your information on an individual tag level, I opt to ungroup them so that all tags have their own data row and can be processed. For this, I use the Create Collection node with a wildcard filter on tags_* to ensure that it picks up the tags splitter earlier.

This has the additional benefit that you can apply a function to count the number of elements in the collection. As such,
with round(column("Duration") / arrayLength(column("AggregatedValues"))) you can achieve your desire to divide the duration by the number of tags of that record.

The duration is now normalized.

Next step is to ungroup the collection. Taken the earlier mentioned example, you’ll see that the duration of besserwerden and health is 1499 seconds instead of 2997.

Before:

After:

Moving on, I kept your double pivot more or less in place because it’s quite manageable at the moment. Here, I’m brining back the duration format with durationOfSeconds(column("Duration"))

After the pivoting, both dataflows need to come together. However, since they have the same names I apply a double Column Rename (Regex). This is an easy way to apply a suffix to all columns. Here (sum) and (count) respectively.

Finally bring them together with a Column Appender.

Noticeably, it’s not sorted very conveniently. This can be fixed with a Column Resorter but it has to be done dynamically since you don’t know what tags you in advance. As such, make it flow variable controlled (you have seen that before).

When working with flow variables, make sure that comply with the required data type. In this case, string list.

For this, I use a metanode to re-arrange it. Main node here is the Extract Column Header.

It ends with a dynamically created sorting of the columns.

The rest is pretty self-explanatory. You’ll quickly see what it’s doing.

Final result:

See WF:
Time Tracking Minimal - edited knime forum.knwf (139.5 KB)

Have fun!

schalex · October 7, 2022, 1:17pm

@ArjenEX Thank you so much for this amazing answer!
I started implementing it and need more time tonight for finishing.
I learned already so much with this post, it is totally amazing!!!

The learning started with the first recommendation of naming the nodes after what they are doing. I did not know till now that there I am way to name them
But how to name them? I looked for some help in the forum and only find this post about Meta nodes and their names: how to rename a node? - #3 by Geo

Maybe I have a thinking Mistake here.
I have to say it again, your explanation is excellent! Especially thank you so much for the extra explanations of the most important steps in this workflow and the WF file.

Maybe another little question about this forum, is there a way how I can send you a PM?

I will write here of course when I encounter something tonight when applying your feedback and suggestions.
I wish you a nice rest of the day.
Kind regards, Alex

ArjenEX · October 7, 2022, 2:22pm

Double click on the node description (Node xxx) below and it will pop-up the text editing box.

For metanodes and components, use the Reconfigure option.

Unfortunately you can’t.

schalex · October 8, 2022, 8:35am

@ArjenEX, Thank you so much for your support.
I started implementing the whole workflow, as you showed me, and really like so much what you explained me and I learned.

I have done one little adjustments during the implementation due to missing details from my side when explaining the dataset.

The Tags Element could also be empty. Therefore, I added a little If Clause in the Column Expressions Block to filter for this case and not divide by inf:

if(arrayIsEmpty(column("Tags"))) {
    column("Duration") 
} else {
    round(column("Duration")/arrayLength(column("Tags")))
}

I could follow all your explanations and really like the details you are indicating and showing. For example, in the Column Resorter you highlighted the indication of the String type which is expected from the flow variable. That is just amazing!
I never Used Ungroup so far and also see GropuBy now with a whole more power than before.

I changed a little bit the Column Header sorting flow, because I wanted to have the Date and Activity at the first column while sorting the Tags Alphabetically. This is my little adjustment:

I only Sort the Tags and afterwards add the date and activity again to it in the Concatenate block.

I like a lot the Metanode potential for making everything better manageable. Is there a way, to sync Metanodes? What I mean with sync is the following, I Used the Sort Alphabetically meta node at another place afterwards. Then I noticed that something should be changed inside of it. Of course, this change was afterwards only in this second instance of the Metanode and not in the first one. Is there a way, to have changes in one Metanode done automatically in the other one as well?

@ArjenEX please feel free to connect with me, you can find my details here for 12 hours: My LinkedIn contact details - Pastebin.com

I wish you a nice rest of the day. Kind regards, Alex

ArjenEX · October 8, 2022, 10:06am

You sure can. I would create a folder in your local workspace dedicated to this first and then use the share option of the metanode/component.

From this point, the metanode will be available in your Explorer and it’s a matter of dragging and dropping the metanode into other workflows.

If you want to update it, disconnect it, makes the changes and share it again (overwrite it). Next time you open a workflow that contains that meteanode, you will be prompted to update it in that workflow which will be done automatically.

More here:

schalex · October 9, 2022, 6:58pm

Good Evening,
I am working with the grouped Data, thanks to @ArjenEX for the help with the preparation of it.
During my work, I noticed that it is a bit inconvenient to plot the data in my current structure.
For example, when I want to plot Sport instead of the tags of sport, I have to group and pivot the data in several steps, which makes it very complicated to manage.
Also, if I want to plot only the # Columns which are available in Sport I need several nodes for preparing the table for plotting.

Therefore, I wanted to ask in general, what kind of data structure do you recommend me to built with this tracking data, for working with it better afterwards?
I want to wrangle with the data, plot sometimes the sport and then look again into the tags when evaluating in more detail some effects and relations.

Maybe as an extra note, I also join this table with sleep data from each night. In the current format of the table, the Joint table has for each tracking date the identical sleep data because of the matching by date and multiple dates on the tracking with single dates on the sleep data.
I have the feeling that I should find a way of having all data in Days in my file, before processing it, but the multiple tags for Activities like sport and Private make it a bit complicated for me to assess how this table should be structured for an easier wrangling afterwards.

I thank everyone helping me. I am doing right now the L1-DW course, but wanted to work parallel with my little dataset for applying my learnings
I wish you a nice rest of the day. Kind regards, Alex

badger101 · October 9, 2022, 10:00pm

That’s really helpful @ArjenEX , I had never used the Share function before. I can see how it will help me in the future.

system · October 16, 2022, 10:01pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.