rule based row filter parallel execution

Hi All,

Is there anyway I can execute Rule based Row filter node in parallel ?
Requirement is I have 1 millions records as an input table data for Rule based Row filter node containing data for different rules and total number of rules are 100 ,I just want to execute all 100 rule in parallel because in serial it is taking approximately 50 minutes. I want to reduce its time.

Hi,
can u feed us a little example about your data and what type of filters you have to test?

Luca

Hello @sahil786,

50 minutes does sound a lot but so does 100 rules. Maybe you can try to use KNIME Streaming Execution (Beta) to speed it up.
See this blog post: https://www.knime.com/blog/streaming-data-in-knime

Another way would be to split your input data into smaller data sets using multiple Rule-based Row Splitter node and connect each smaller set to one Rule node. Finally use Concatenate to bring data back together.

Br,
Ivan

As @ipazin mentioned, try to stream and I can assume that Rule-based Row Filter (Dictionary) may be a better option in your case.

1 Like

Core issue is that rule-bases row filter most likley isn’t all that fast and the intended use-case was most likley for a handful of rules and not as many as 100 rules.

Depending on the rules you could try to chain them (potentially using splitter instead of filter) and wrap that in a component with streaming execution.

Or code the rules into a Java Snippet.

1 Like

Hi @izaychik63 @ipazin ,

Rule-based Row Filter (Dictionary) is doing the same thing in 1 minute for all 1 million records and 100 rules but now I am unable to map the rule_id with the filtered data.Is there any way to map rule_id with
Rule-based Row Filter (Dictionary)?

In my existing design I am mapping the rule_id as well with the help of constant value column by passing rule id as variable from Rule_sheet(Attached)rule_SHEET.xlsx (11.0 KB) .

Attaching the Sample_input_data.SAMPLE_DATA.xlsx (34.2 KB)

Attaching the workflow as well.

Please find the image of existing desgin.

Hello @sahil786,

if you want to add rule_id I would suggest a bit different approach. First use Rule Engine (Dictionary) node to add Rule_id column. IF no rule matches you will have missing value based on which you can filter then. Check attached example:
rule_based_filter_dictionary_ipazin.knwf (49.9 KB)

String Manipulation node is only needed to modify existing rules. Additionally see how you can include and reference data into your workflows using data folder and relative to option from Excel Reader.

Br,
Ivan

2 Likes

Thank you @ipazin…I have tried this same approach and it worked for me…Thanks again…

2 Likes

Hi @ipazin,

I Have found one issue in RULE ENGINE(DICTIONARY) for below scenario:

$RESIDENCY $ in (“JOY”,“JON”) AND UNQID IN (1,2)=>“P”
$RESIDENCY $ =“JOY” AND UNQID IN (1,2)=>“Q”

I have written above two condition in RULE ENGINE(for testing) which can be a valid scenario but RULE engine is not working for second condition because this condition is covered in First condition(IN operator) which should not be the case. Please suggest something, It is Really important.

This is exactly the case and should work this way. In your case just put second line first.

I have tried that But still Rule engine is not working

Input Data:
image

Input Data
1|red|
2|green|
2|red|
1|green|

RULE ENGINE Logic:
$Numbers$ =1 AND $Colors$ IN (“red”,“green”)=>“Q”
$Numbers$ IN (1,2) AND $Colors$ IN (“red”,“green”)=>“P”

RULE ENGINE OUTPUT:
image

Ideally P should Populate in front of all the four rows which is not happening. and Q should populate in front of Two rows only which is happening.

Hello @sahil786,

how should your output look like?

Br,
Ivan

Hi @ipazin,

I need this as an output:
|1|red|Q|
|1|green|Q|
|1|red|P|
|2|green|P|
|2|red|P|
|1|green|P|

Rule engine checks for first match and returns that value right? so that’s what I would expect. For getting both you probably need to run 2 rule engines.Your mentioned structure probably needs additional shaping afterwards
bR

I have 50000 rules in my rulesheet…i am trying to loop the rule engine…it is working but with really poor performance which I cant afford​:grimacing::grimacing:…so @izaychik63 @ipazin @Daniel_Weikert…please suggest some better approach if possible

You can use


Also, the node is streamable.

Hi @izaychik63,

I have already tried to implement this on 6 oct but having some issue if u can check the past comment of mine on this same post

Hello @sahil786,

if you need 2 or multiple outputs (as multiple rows) from a single row then using single Rule Engine obviously won’t work. You can try following approach:

  • use modified rules whenever you need/can get multiple outputs. Rule should output all needed values separated with comma or any other delimiter example: value1, value2,..., valuen
  • follow it by Cell Splitter node with specified delimiter
  • finish it with Unpivoting node where value columns will be all column created with above splitting operation and retained columns are all those you wish to leave and “multiply”

Br,
Ivan

Hi @ipazin,

Sorry I did not understand Correctly ,How can we make use of modified rules.
Do we need to make change in the rule sheet.

I am attaching a sample rule sheet for the reference:rule_SHEET.xlsx (9.6 KB)

Also the sample data SAMPLE_DATA.xlsx (34.2 KB)

I am Using the workflow provided by you. Please find the same.

Kindly give me more clear picture on this.

I think you have not specified the outcome of your rule engine filter. sth like “If condition A is true what should happen”
bR