Transform in interval start_date - end_date

razvancomanici · September 22, 2022, 6:16am

Hello!

I have these data in a table:

Id	type	timestamp
1234	active	2022-05-16T20:12:41
1234	inactive	2022-06-05T20:43:11
1234	active	2022-06-06T04:18
1234	inactive	2022-06-06T07:36:51
1234	active	2022-06-06T11:08:03
1234	inactive	2022-06-09T16:35:42
1234	active	2022-06-09T16:36:54
1234	inactive	2022-06-09T16:43:06
1234	inactive	2022-06-16T11:50:15
1234	inactive	2022-06-24T13:18:24
1234	inactive	2022-06-25T08:00:23
1234	inactive	2022-07-01T23:20:59
2345	active	2022-05-16T20:12:41
2345	inactive	2022-06-05T20:43:11
2345	active	2022-06-06T04:18
2345	inactive	2022-06-06T07:36:51
2345	active	2022-06-06T11:08:03
2345	inactive	2022-06-09T16:35:42
2345	active	2022-06-09T16:36:54
2345	inactive	2022-06-09T16:43:06
2345	inactive	2022-06-16T11:50:15
2345	inactive	2022-06-24T13:18:24
2345	inactive	2022-06-25T08:00:23
2345	inactive	2022-07-01T23:20:59

And I want to change this structure like that:

Id	type	start_date	end_date
1234	active	2022-05-16T20:12:41	2022-06-05T20:43:11
1234	inactive	2022-06-05T20:43:11	2022-06-06T04:18
1234	active	2022-06-06T04:18	2022-06-06T07:36:51
1234	inactive	2022-06-06T07:36:51	2022-06-06T11:08:03
1234	active	2022-06-06T11:08:03	2022-06-09T16:35:42
1234	inactive	2022-06-09T16:35:42	2022-06-09T16:36:54
1234	active	2022-06-09T16:36:54	2022-06-09T16:43:06
1234	inactive	2022-06-09T16:43:06	2022-07-01T23:20:59
2345	active	2022-05-16T20:12:41	2022-06-05T20:43:11
2345	inactive	2022-06-05T20:43:11	2022-06-06T04:18
2345	active	2022-06-06T04:18	2022-06-06T07:36:51
2345	inactive	2022-06-06T07:36:51	2022-06-06T11:08:03
2345	active	2022-06-06T11:08:03	2022-06-09T16:35:42
2345	inactive	2022-06-09T16:35:42	2022-06-09T16:36:54
2345	active	2022-06-09T16:36:54	2022-06-09T16:43:06
2345	inactive	2022-06-09T16:43:06	2022-07-01T23:20:59

Any idea how I do that?

Than you!
Razvan

takbb · September 22, 2022, 9:26am

Hi @razvancomanici,

What I’m taking to be the requirement is that you want to find the block of rows, such that consecutive rows of the same “type” are considered to be in the same block. A block’s start_date is then the earliest timestamp of block, and the end_date becomes the start_date of the next block.

To achieve this, the first step is to find at what the points the “blocks” start and end. We can use the lag column to place on each row the “type” of the previous row. After this, a Rule Engine can identify where a block starts (a block’s start is where the “type” differs from the previous row’s type).

What you then want to do is give each “block” a unique number.

There is a useful pattern that can do this. If the Rule Engine returns a 1 on a row where the type changes, and a 0 where it doesn’t, a “Moving Aggregation” node can be used to provide the “cumulative sum” of these 1s and 0s.

How does this help? Well, this little trick gives us a “group number”. Look at how the “sum(type_changed)” column in the following table now coincides exactly with groupings of “type”

Once we have this, we can “group by” this column (that I shall rename to “grouping number” for clarity) and find the min(timestamp) “start_date” and max(timestamp) “end_date” for each “grouping number”

Joining those start_dates and end_dates back to the full dataset (joining using “grouping number”), and then removing unnecessary columns, and finally applying a “duplicate row filter” to keep only the first row for each “grouping number”, we have the output you need.

Group data and transform from rows to columns.knwf (28.4 KB)

razvancomanici · September 22, 2022, 11:36am

Yes, it helps me! Thanks a lot!