How do I split a column when a value drops?

Kwinten · September 15, 2022, 12:23pm

Hello everyone,
I am using Knime version 4.6.1. to process data from a production process.
I have a dataset from a level transmitter. The level in the tank continuously rising until the content is pumped away, after that this process will repeat.
What I want to do is to split the dataset every time this process happens. So, every time the tank fills and empties that the data block separates from the rest of the data.
Here is an image for clarification. The arrows are where I would like to separate the data.

Thanks everyone for your time.
Kwinten

ipazin · September 15, 2022, 12:37pm

Hello @Kwinten,

and welcome to KNIME Community!

What’s the logic behind separation blocks? I mean how do you know when the tank empties cause arrows are confusing me? Average of 58.377 is already when tank was emptied or is it 66.52?

Br,
Ivan

Kwinten · September 15, 2022, 1:30pm

First thanks for reacting @ipazin,

I will give a little more context to hopefully make it clearer. The process of filling the tank is a continuous process, the process that is after the level transmitter is a batch process. So, every time the level value drops for example from 71.331 to 58.377 a batch is created. The idea is that by separating the column into blocks I will be able to track the batches in a day worth of data.

For the image, I would like to cut the column after the first level drop. So, the first block in the image would be [Row36 – Row43] the second block would be [Row44 – Row 50].

I hope this clarifies my goal.

Kwinten

ArjenEX · September 15, 2022, 1:40pm

Just curious, what makes the drop from 81.115 to 71.331 different than the one from 71.331 to 58.377?

For the image, I would like to cut the column after the first level drop.

I believe you mean after the last level drop here otherwise that rule would already apply to the jump from 81.115 to 71.331 and the cut should be made there right?

Kwinten · September 15, 2022, 2:00pm

@ArjenEX
Yes, you are right. This is a mistake on my part. It should be indeed after the last level drop, otherwise it would create two batches instead of one.
Thanks for pointing it out.
Kwinten

iCFO · September 15, 2022, 2:59pm

If you upload a workflow and embed some dummy data then it typically speeds up responses. Adding to a workflow solution is usually quick, but having forum members enter the dummy data is always prohibitively slow and leads to solutions that don’t suit your aim.

badger101 · September 15, 2022, 3:25pm

Hi @Kwinten , from the info you gave, I think the 2nd block should be [Row44-46], if you look at the numbers carefully?

Daniel_Weikert · September 15, 2022, 5:52pm

Probably LAG node can be of help here. As already pointed out sample data is needed.br

ArjenEX · September 15, 2022, 10:23pm

Noted @Kwinten!

Like others have mentioned, an uploaded dataset would do a lot in cases like this to avoid other having to recreate it from scratch.

You can approach this in various with different nodes out there. Also depends a bit on how familiar you are with coding. Below is a way to solve it through Column Expressions.

For the sake of saving time, I consider your timestamps as a sequence of ID’s.

I first want to determine if the current value has increased or decreased compared to the previous row, to later establish the groups based on this. With the updated syntax of the Column Expression node in 4.6, you can achieve this by using:

if (column("Average",-1) <= column("Average")) {
    "Increase"
} else {
    "Decrease"
}

Since we established that the cut-off should be after the last “decrease”, I determine the end of each group by finding all rows where the next_val = decrease and the new row = increase.

if (column("next_val").equals("Decrease") && column("next_val",1).equals("Increase")) {
    md5Checksum(rand()) 
} else {
    null
}

To later identify the groups easier, I opt to use a hash as identifier based on a random() value. The output of group_id now designates all “endpoints of each group”. Noticeably, @badger101 is right that there is another group hidden in there.

To complete the group_id’s, you can use the Missing Value node with the Next Value as replacement

If you then start a Group Loop based on this identifier, you can do whatever your use case desires (like writing it to a separate file, etc.).

Add a general Loop End and enable the Add Iteration Column to visually see which records got processed within each group.

With a small clean-up, a test setup would look something like this:

See WF:
How do I split a column when a value drops.knwf (42.5 KB)

Hope this provides some inspriation!

bruno29a · September 16, 2022, 5:35am

Nice one @ArjenEX , although I still don’t get how faithful you are to Column Expressions

Any reason why you need to use md5Checksum(rand())? This is not a unique value, although the chances of collision are verrrrrry tiny (not because rand() could give you the same value, but rather that hashing itself is not unique - you have a limited number of combination that can exist).

Why not simply use the ID or column1? They’re unique values.

if (column("next_val").equals("Decrease") && column("next_val",1).equals("Increase")) {
    column("ID") // or column("column1") 
} else {
    null
}

It will be faster

I would have definitely use the same approach, but instead of the Column Expressions, I’d use the Lag and Rule Engine nodes instead (personal choice of course )

ipazin · September 16, 2022, 7:00am

Hello there!

Although I’m a fan of low-code/no-code approach, in some cases scripting is just a way to go. Especially if it gets too complicated with nodes and hard for someone else (and you after some time) to figure out what exactly is going on in this flow. Then it doesn’t matter if one has to figure out code (which might be even easier) or 5-6 nodes…

Here is solution to have it all in one (Column Expressions) node. It also takes into account (or better to say doesn’t take) decreases smaller than 2%. It doesn’t mark those as “decrease”.

var i
if(rowIndex() == 1)
i=0
else
i

if(column("Average") < column("Average",-1) * 0.98 && column("Average") < column("Average",1))
i++
else
i

@Kwinten try changing encoding in your Reader node to get proper name of your first column.

Br,
Ivan

ArjenEX · September 16, 2022, 7:09am

The Column Expression has never let me down and it’s a bit of an old habit from the ETL tool we used internally before KNIME where a rule engine wasn’t a thing

That rand() thing was a late night brainfart to make it stand-out, ID could have been better fitting.

@ipazin has a better solution anyway

Kwinten · September 16, 2022, 8:28am

First, thank you all for reacting.

I was already fearing it would come down to coding, for I have very little proficiency in it. But your ideas work perfectly and now I can move on with my project. The next time I post I will provide some data in a workflow so it’s easier to respond, as you all mentioned.

Thanks again,

Kwinten

gonhaddock · September 16, 2022, 11:31am

Hi @Kwinten
Alternatively you can use a non/low code solution:

20220916_split_column_when_value_drops.knwf (45.6 KB)

BR

gonhaddock · September 16, 2022, 12:07pm

PS.- I’d realized now that cycle sequence is not ordered chronologically, inserting a 'Moving Aggregation (Cumulative computation) to the random values will solve the issue; or applying the cumulative computation pointing to the $Average$ column will save the Random Assigner node as well (not needed)

system · December 15, 2022, 12:08pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.