Ways to speed this up? 10M + rows

I think you have to activate the “KNIME Columnar Table Backend” - it might be a default after a certain version - not sure.

It is based o Apache Arrow format which also is being used to communicate better with the Python nodes - although often you would work with Pandas dataframes there and then convert them to Arrow.

Especially in the beginning I experienced some problems with the framework especially when it came to Loops and Chunks and the like - without always being able to exactly pin that down. Same with this task. At one point there was an error with the RowID node - but the next time it disappeared.

Another option for such tasks might be to store the data in a Parquet file in Chunks, then do the filtering over those physically stored chunks and then store the results. Parquet would not hold all the data types that KNIME table does though.

1 Like