Row filtering by amount of change

Hi there,

I’ve got the following challenge:
I would like to filter a list of values such that a value is only retained if it has changed by at least a certain amount.
It is easy (using e.g. the Lag Column node) to calculate a deviation w.r.t. the direct predecessor row, but I need the deviation w.r.t. the last value that was not filtered out.
Here is an example in which I would like to keep values ONLY if they have changed by at least 50%:

Initial list:
10 <-- starting value ==> keep!
12 <-- last kept value (i.e. 10) +20% ==> filter out
11 <-- last kept value (i.e. 10) +10% ==> filter out
15 <-- last kept value (i.e. 10) +50% ==> keep!
16 <-- last kept value (i.e. 15) +7% ==> filter out
7 <-- last kept value (i.e. 15) -53% ==> keep!

So, my desired result would be:
10
15
7

Can anyone point me to a solution for this?

Thanks!
Andreas

Hi,

I’m not sure if this is the best solution and just tried to build the workflow asap:

I used a recursive loop to filter the first row and then in each iteration it checks whether the next row meets the conditions you asked for or not.

Here is the workflow:
filter-change.knwf (52.3 KB)

Please check the workflow and let me know if it works as you expect and don’t hesitate to ask any questions.

Best,
Armin

P.S. I’ll try to optimize the solution or in the meantime maybe someone else can come up with a better solution.

4 Likes

Hi Andreas
i never used the Lag Column node and do not know what it does. But for filtering of the data i normaly would use the Java Snippet Row Filter Node or Java Snippet Row Splitter Node with the syntax “return Math.abs($Column$) >= 50.0;”.
Whereat $Column$ stands for the Column containing the deviation values.

Regards
Hermann

Thanks for the workflow!
It does indeed do the job that I needed.
However, the recursive loop makes it quite slow. In my real application I have >500,000 rows in the initial dataset. So looping (in KNIME) is not really feasible.
I was hoping that someone might come up with an idea like “oh, this is a standard problem in signal processing (and there is a node which does exactly what you need)” … :slight_smile:
Any other ideas?

1 Like

Hi Hermann,
the challenge is not the filtering itself. The challenge is to find the right value to compare with. I always have to compare with the last value that was not filtered out. That could be the direct predecessor (i.e. “row-1”), but it could also be 1000 rows before. … You don’t know beforehand…

Hi there!

I gave it a thought but not sure you can avoid some kind of loop in your use case. What you can focus on is workflow optimization. There was a blog post about it so take a look.

Additionally I have created a workflow similar to one Armin created but avoided Column Expressions node as it can slow things down. The workflow has 100.000 rows and runs for about 45 seconds which is not lot I think so take a look.
2019_04_25_Row_Filtering_By_Change.knwf (33.9 KB)

Considering your use case and my workflow design execution time depends on number of rows that need to be kept (more rows means more loop iterations which means more execution time) and distribution of that rows (more rows to be kept at the beginning means more iterations with larger data set which again means more execution time).

What could be useful is to stop node execution as soon as you find the first row that satisfies you condition but not sure how to implement that in KNIME :confused:

Br,
Ivan

2 Likes

Good things come to those who wait! :wink:
I think the output that you asked for needs the loop unless I’m missing something. Check the solution by @ipazin, maybe it does the task faster.

Armin

1 Like

Nice workflow! Thanks a lot. It really reduces the runtime to an acceptable amount.
I think I will go with this solution. :slight_smile:

3 Likes

Hi there,

do you ever thought about the use of the ROWINDEX (java snippet: return $$ROWINDEX$$;) in combination of the GroupBy and the Joiner node?
This should allow you to filter the last value without looping.

Regards
Hermann