Readers which filter during read

bfrutchey · March 8, 2020, 1:51pm

We are playing with using KNIME to process big data in a distributed way without leveraging an external distributed processing platform. We have worked out several ways to distribute a single “master workflow” across multiple “child workflows” which each perform a portion of the work on different servers. For example, one could train models on different subsets of data in parallel on different machines.

One sticking point is how we enable each child workflow to only interact with the portion of the data (or other resources) the child workflow needs when the data starts in one (generally) monolithic chunk. Currently the default KNIME nodes mandate that we (1) “map” (or chunk) the resources before calling the child workflows and assign each child the chunk they are to process, or (2) have the child workflow read the full data (or other resources) and then filter it to what the process needs. (1) Requires unnecessary replication, and (2) add unnecessary processing time.

What would be nice is the ability to filter during resource reading. For example, applying a filter in the Parquet Reader to only receive records that match. Any chance there are ways to do this currently I am not thinking of (like more advanced options than the row limit filters on some Readers)? If not, can I lobby to have this capability added to the development roadmap?

izaychik63 · March 8, 2020, 9:12pm

Parquet Reader in pare with Row Filter could be streamed.

bfrutchey · March 13, 2020, 11:02am

Interesting option @izaychik63, but this still requires that every record being streamed and won’t reduce I/O, correct? I am hoping for a solution where the filter can be applied before records must be moved across the network into the KNIME process.

ipazin · March 13, 2020, 6:27pm

Hi there @bfrutchey,

couple of points:

to my understanding streaming reduces both I/O and memory usage
not a developer but to have filter in reader node should mean that you basically combined two operations into one and you still have to go through every record and decide to filter it or not. In case of big data you might not be able to do it in memory and thus have to write it down again… (not saying it can’t be faster!)
like (1) and don’t see how it produces “unnecessary replication”

Br,
Ivan

bfrutchey · March 14, 2020, 1:29am

Essentially I am asking for KNIME to take advantage of “predicate pushdown” to reduce I/O through identifying what portions of serialized data needs to be read to satisfy a query. This may require more nuanced management of parquet file partitioning which the user helps configure, but is automatically available in many performance-oriented SQL-on-parquet engines like Drill and Impala.

ipazin · March 16, 2020, 12:16pm

Hi @bfrutchey,

I see now. I have checked it and actually this is something (column/row filter within reader node) that is planned with new file handling framework that are currently being developed.

Tnx for bringing this up on forum and stay tuned!

Br,
Ivan

system · September 15, 2020, 12:16am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.