Removing n-number of rows before and after missing value row(s)

Dear KNIMErs,

I need to remove a certain n-number of rows (with value) that come before and after missing value row(s).
Below is a rough descriptive visual of the problem.

The dataset is huge, consisting of > 3mln rows. I’m trying to figure out what kind of combination of nodes can help me achieve this.

Context of the problem: The removal of data would help to get rid of the noise in the data. As I observe, there are sudden drops in values around missing value rows, so these need to be removed. Later, I will do a linear interpolation.

Thank you!
Odko

Hi Odko,
first of all i would create a column with a row number ($rowindex$).
second build increasing groupid value for each change of true and false.
third group by the groupid column to get the max. row number for each group.
fourth join the grouped information to your source table.
fifthly you can now apply filter rules.

BR
Hermann

Hi Odko,
try this: removing.knwf (77.3 KB)


i’ve arbitrarily set the number of row to 3.

Luca

1 Like

Hi Luca,

Thanks a lot for helping build the knime flow!
The method looks very easy and useful!

I just realised that the way I formulated my question was not clear enough… The n-rows actually should not be removed, but the values should be replaced by missing values.

As shown below:

I’ve developed a knimeflow that gives what I need (see the output screenshot below: it shows that there was one False value (yellow highlight), and values were removed from 17 rows above and 17 rows below):

However, there is another challenge that I encountered: sometimes, there might be (less than 17 rows of True values surrounded by False values as shown below:

I wonder how a knimeflow can be improved to handle such challenge…

PS: The rule is that 17 rows above from the first False value, and 17 rows below the last False value should be replaced with a missing value cell.

Thanks!
Odko

Thank you Hermann for describing the flow design. I couldn’t literally implement the second point, but I used a workaround that helped to differentiate the first and last false values by using lag column node. However, while running loops, I encountered another problem, which is in detail described in my reply to Luca_Italy.

Thank you,
Odko

Hi @Odko

A nice challenge. See this workflow removing_2.knwf (131.5 KB) where you can modify the number of Counting Loops to determine how many records before and after a missing value sequence you want to make missing also.


gr. Hans

Hi Odko,
is it possible to uploade an example file and your current workflow?
If you can assign a specific ID (counter) for each group (based on my suggestion) you have the option with the group by node to determine the minimum row, maximum row, count of rows for each group and you are able to set your rules based on this information.

Q: what shold happen if there are equal or less than 17 rows available?

BR
Hermann

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.