Transform in interval start_date - end_date

Hello!

I have these data in a table:

Id type timestamp
1234 active 2022-05-16T20:12:41
1234 inactive 2022-06-05T20:43:11
1234 active 2022-06-06T04:18
1234 inactive 2022-06-06T07:36:51
1234 active 2022-06-06T11:08:03
1234 inactive 2022-06-09T16:35:42
1234 active 2022-06-09T16:36:54
1234 inactive 2022-06-09T16:43:06
1234 inactive 2022-06-16T11:50:15
1234 inactive 2022-06-24T13:18:24
1234 inactive 2022-06-25T08:00:23
1234 inactive 2022-07-01T23:20:59
2345 active 2022-05-16T20:12:41
2345 inactive 2022-06-05T20:43:11
2345 active 2022-06-06T04:18
2345 inactive 2022-06-06T07:36:51
2345 active 2022-06-06T11:08:03
2345 inactive 2022-06-09T16:35:42
2345 active 2022-06-09T16:36:54
2345 inactive 2022-06-09T16:43:06
2345 inactive 2022-06-16T11:50:15
2345 inactive 2022-06-24T13:18:24
2345 inactive 2022-06-25T08:00:23
2345 inactive 2022-07-01T23:20:59

And I want to change this structure like that:

Id type start_date end_date
1234 active 2022-05-16T20:12:41 2022-06-05T20:43:11
1234 inactive 2022-06-05T20:43:11 2022-06-06T04:18
1234 active 2022-06-06T04:18 2022-06-06T07:36:51
1234 inactive 2022-06-06T07:36:51 2022-06-06T11:08:03
1234 active 2022-06-06T11:08:03 2022-06-09T16:35:42
1234 inactive 2022-06-09T16:35:42 2022-06-09T16:36:54
1234 active 2022-06-09T16:36:54 2022-06-09T16:43:06
1234 inactive 2022-06-09T16:43:06 2022-07-01T23:20:59
2345 active 2022-05-16T20:12:41 2022-06-05T20:43:11
2345 inactive 2022-06-05T20:43:11 2022-06-06T04:18
2345 active 2022-06-06T04:18 2022-06-06T07:36:51
2345 inactive 2022-06-06T07:36:51 2022-06-06T11:08:03
2345 active 2022-06-06T11:08:03 2022-06-09T16:35:42
2345 inactive 2022-06-09T16:35:42 2022-06-09T16:36:54
2345 active 2022-06-09T16:36:54 2022-06-09T16:43:06
2345 inactive 2022-06-09T16:43:06 2022-07-01T23:20:59

Any idea how I do that?

Than you!
Razvan

Hi @razvancomanici,

What I’m taking to be the requirement is that you want to find the block of rows, such that consecutive rows of the same “type” are considered to be in the same block. A block’s start_date is then the earliest timestamp of block, and the end_date becomes the start_date of the next block.

To achieve this, the first step is to find at what the points the “blocks” start and end. We can use the lag column to place on each row the “type” of the previous row. After this, a Rule Engine can identify where a block starts (a block’s start is where the “type” differs from the previous row’s type).

What you then want to do is give each “block” a unique number.

There is a useful pattern that can do this. If the Rule Engine returns a 1 on a row where the type changes, and a 0 where it doesn’t, a “Moving Aggregation” node can be used to provide the “cumulative sum” of these 1s and 0s.

How does this help? Well, this little trick gives us a “group number”. Look at how the “sum(type_changed)” column in the following table now coincides exactly with groupings of “type”

Once we have this, we can “group by” this column (that I shall rename to “grouping number” for clarity) and find the min(timestamp) “start_date” and max(timestamp) “end_date” for each “grouping number”

Joining those start_dates and end_dates back to the full dataset (joining using “grouping number”), and then removing unnecessary columns, and finally applying a “duplicate row filter” to keep only the first row for each “grouping number”, we have the output you need.


image
Group data and transform from rows to columns.knwf (28.4 KB)

3 Likes

Yes, it helps me! Thanks a lot!