Row Filter/ Splitter Performance

davekalpak · May 24, 2018, 3:22pm

I am experiencing unnecessarily slow performance when filtering/splitting large datasets based on row number range. In my particular case, I am splitting my table at the first row, i.e. first row goes to top port while the remaining rows go to the bottom port. Intuitively, filtering based on the index of the dataset should be a fast operation… but KNIME seems to be doing this in a way that is quite slow.

RolandBurger · May 25, 2018, 8:29am

Hi @davekalpak,

The problem with the Row Splitter is that even though only one row goes in the top output, the node still has to write a table with all other rows in the bottom output. I.e., it still needs to write the full table so splitting is different from just filtering. Note that this doesn’t happen with the Row Filter since the not-selected rows are discarded. Hope that explains things a bit!

Cheers,
Roland

davekalpak · May 25, 2018, 2:23pm

Thank you, Roland. I appreciate your prompt reply.

Very true. The Row Filter does complete immediately once the criteria are satisfied. However, the intent of my question might have been distracted by my example. Let’s take the case of the Row Filter. When I specify a range, let’s say the middle third of a large table. It appears that the mechanism in the background is checking every row up to the start of the range, then collecting data until it reaches the end of the range - from there the results are returned (hence, the last third of the table is ignored). Assuming that the data from the previous node is already loaded and available, if I’m filtering by the index of a table, I imagine there are more efficient ways to do it than explicitly checking every row from top down.

Another point that I would like to make, which you touch on in your response - is regarding how data is “written” to the output. I will attempt to explain my confusion with an example.
Let’s take the Column Filter. Regardless of how large (rows or columns) my input is, and regardless of the type of filtering I’m doing (column selection, pattern matching, etc…), the Column Filter executes extremely quickly. It does not appear to be “writing” new data to its output - I would assume it is referencing the previous node’s output and transforming it on the fly - so there is only one copy of the data (from the previous node). If not that, then somehow it is immediately creating a copy of the input and dropping columns that need to be filtered out. The execution of the Column Filter is much too fast even for that…
I think that the Row Filter and Splitter could benefit from the methods that work so well for the Column Filter - especially when it comes to index-based filtering.

Thank you for your attention,
David