Validate various values based on different rules

Dennis6713 · April 2, 2019, 2:24pm

Hello all,

I am quite new to Knime (1 week) and have a few questions.

I am trying to set up a workflow for checking multiple columns with multiple criterias, but I don’t know which nodes I should use for. The Task is the followng:

I have a table with 49 columns and ~500k rows. Based on the values of two of these columns, I Cluster the rows into Groups. According to these Groups, the values in the remaining columns need to have certain, predefined values.

So the logic must be:

If Column A has Value X and Column B has Value Y, then Columns C-XXX need to have predefined values.

Currently I tried to set up different row filters for the values to just Display the Errors, where the values do not fit, and afterwards combine everything in one Output file again. But this would mean, that I need 40+ nodes (row filters) to check every value. Also tried it with JAVA Snippet and IF-Then Statements, but this would also led to a lot of individual nodes.

Now the question: Is there a better Approach to my Problem?

Thanks in advance and I hope that my issue is understandable

Best,
Dennis6713

€: Example from post below:

	C1	C2	C3	C4	…	Cn
1	V1	X1	A1	…	…	…
2	V1	X2	A2	…	…	…
3	V2	X3	A3	…	…	…
4	V3	X4	A2	…	…	…
5	V4	X5	A1	…	…	…
6	V4	X5	A2	…	…	…
7	V4	X6	A3	…	…	…
8	V4	X7	A4	…	…	…
9	V5	X8	A1	…	…	…
…	…	…	…	…	…	…
n	Vn	Xn	An	…	…	…

What I want to do now is the following:

Check if there are duplicates in C1 (e.g. Row 1 and Row 2, Row 5 - Row 8)
–> If yes, then check if the values in C2 are also the same. (e.g. Row 5 and Row 6)
–> Then check the value in C3. All of the Columns C4…Cn in row 5 and row 6 need to match predefined values, based on the value in C3.

–> If the values in C2 are not the same, check C3 and all of the Columns C4…Cn in this row need to match predefined values, based on the value in C3.

The output should be an excel file where the Cells (Column C4…Cn) are highlighted, in which the values do not match the predefined values.

izaychik63 · April 2, 2019, 2:45pm

As you mentioned clusters, I would recommend to use

and / or

to assign the cluster number and then filter necessary one.

Dennis6713 · April 2, 2019, 2:51pm

Hello izaychik63,

thanks for your reply. I also tried to use the Rule Enigne node, but I dont think that this will work for me. because all of my checks, after the first 2 columns, have to be checked anyway, and not only in an order.

izaychik63 · April 2, 2019, 3:00pm

It would be easier to help if you describe the task on hand not just one piece of it.

Dennis6713 · April 2, 2019, 3:25pm

Hello izachik63,

Sorry if this was not clear enough. The whole Task is:

Read two Excel files, join the tables and then start to create the “Groups”. This is in Detail: If Column A has value X and Column B has value Y, then check all remaining columns if the values in each field is equal to the predefined value for this field. the next branch would be, If Column has value X and Column B has Value Z, then check the remaining columns of the values [see above].

The Output should be a file, with all data from the Input files and in Addition a remark/highlighting/what ever, if a value doesnt equal the predefined value.

Additional question: Is it possible to have the predefined values somewhere stored in anode or something so they can be easily changed?

Dennis6713 · April 4, 2019, 9:11pm

Can no one help me?

mlauber71 · April 4, 2019, 9:26pm

Could you create a simple example in Excel and upload it so it would be easier to understand and we would be able to work on a concrete case?

Dennis6713 · April 6, 2019, 2:53pm

Hi mlauber71,
hi all,

please find the following table for better explanation:

	C1	C2	C3	C4	…	Cn
1	V1	X1	A1	…	…	…
2	V1	X2	A2	…	…	…
3	V2	X3	A3	…	…	…
4	V3	X4	A2	…	…	…
5	V4	X5	A1	…	…	…
6	V4	X5	A2	…	…	…
7	V4	X6	A3	…	…	…
8	V4	X7	A4	…	…	…
9	V5	X8	A1	…	…	…
…	…	…	…	…	…	…
n	Vn	Xn	An	…	…	…

What I want to do now is the following:

Check if there are duplicates in C1 (e.g. Row 1 and Row 2, Row 5 - Row 8)
–> If yes, then check if the values in C2 are also the same. (e.g. Row 5 and Row 6)
–> Then check the value in C3. All of the Columns C4…Cn in row 5 and row 6 need to match predefined values, based on the value in C3.

–> If the values in C2 are not the same, check C3 and all of the Columns C4…Cn in this row need to match predefined values, based on the value in C3.

The output should be an excel file where the Cells (Column C4…Cn) are highlighted, in which the values do not match the predefined values.

I hope this is now easier to understand

Thanks in advance!

mlauber71 · April 7, 2019, 11:16am

I took your example and put it into a workflow. Checking the columns C1 and C2 if they contain duplicates (you can amend the second check by including C1 also if this is what you need). And then using collection columns to see if the unique values of C3 match all the unique values of columns C4 - C6.

You might adapt that according to your needs.

kn_example_mark_doubles_and_rules.knwf (81.2 KB)

Dennis6713 · April 8, 2019, 12:24pm

Hello mlauber71,

thank you very much. This helps a alot.

As far as I understand, you are using the Create Collection Column node to create 2 columns which you are then comparing with the Rule Engine node. With the Rule Engine node you are checking, if the columns (C4-C6) have the same Value as Column C3. But actually what I need is, that based on the value in Column C3, different rules have to be applied to check the values in C4…Cn. Also, for each Column (C4…Cn) i need to check different values.

For example:

C3 has the Value A1. This means that C4 Needs to have value 1, C5 = 9, C6=4…Cn=n. (the numbers are only place holders for the real values that need to be checked)

C3 has the Value A2. This would mean that C4 Needs to be 5, C5 = 2, C6= 9… Cn=n etc.

So is it possible to have somewhere this kind of rule set stored and then applied as values to be compared to according to the value in C3? Or do I need to build these Create Collection Columns for each Column I want to check?

And another quesion regarding the first Java Snippet and RowID. Why are they necessary?

Thanks in advance,
Dennis

mlauber71 · April 11, 2019, 5:09pm

The artificial ID is something that I use to preserve the original rank of rows. Later they might get other RowIDs and I want to preserve them.

I can think of several ways to include tests on different columns. Question is if you can construct an example in a way that one could build a general workflow for. Maybe you could put such rules you would actually want to use in amnestying Excel file even in some pseudo code.

Dennis6713 · April 24, 2019, 5:31pm

Hello,

sorry for my late reply. Please find attached a new example of the Input data, the rule set and how the Output should look like.

Input:

	Index1	Column_for_Rule_determination	Column_to_check_1	Column_to_check_2	Column_to_check_3
1	1111	S1	2	A	50
2	1111	S1	3	B	0
3	2222	S2	2	A	23
4	2222	S1	2	A	34
5	2222	S3	3	A	16
6	3333	S2	4	B	23
7	3333	S2	4	B	0
8	4444	S1	5	A	0
…	…	…	…	…	…

Rules:
$Column_for_Rule_determination$ = “S1” => Take Ruleset 1
$Column_for_Rule_determination$ = “S2” => Take Ruleset 2
$Column_for_Rule_determination$ = “S3” => Take Ruleset 3

Ruleset 1:
$Column_to_check1$ = 2 => “OK”
If not => “Not ok”
$Column_to_check2$ = “A” => “OK”
If not => “Not ok”
$Column_to_check3$ <=25 => “OK”
If not => “Not ok”
Ruleset 2:
$Column_to_check1$ = 4 => “OK”
If not => “Not ok”
$Column_to_check2$ = “B” => “OK”
If not => “Not ok”
$Column_to_check3$ =>25 => “OK”
If not => “Not ok”

Output:

	Index1	Column_for_Rule_determination	Column_to_check_1	Result	Column_to_check_2	Result	Column_to_check_3	Result
1	1111	S1	2	OK	A	OK	50	Not ok
2	1111	S1	3	Not ok	B	Not ok	0	OK
3	2222	S2	2	Not ok	A	Not ok	23	Not ok
4	2222	S1	2	OK	A	OK	34	Not ok
5	2222	S3	3	OK	A	OK	16	Not ok
6	3333	S2	4	OK	B	OK	23	Not ok
7	3333	S2	4	OK	B	OK	0	Not ok
8	4444	S1	5	Not ok	A	OK	0	OK
…	…	…	…	…	…	…	…	…

And so on… I have about 5-6 different rulesets and per rulesets about 40 rules. ALL rules have to be checked in each row, depending on the ruleset.

I found the Rule Engine Dictionary where you can upload rules from a file. Is it possible to use this node somehow in order to have all rules applied? You can see in my file, that I have plenty of rules, which all need to be checked in the different columns. With the rule engine node, it just checks the rules until a rule applies and the goes on with the next row. But I need, that ALL Rules have to be checked in ALL Rows. It would be very good to have this upload via File, because then the users later on only have to Change the rules and/or target values in an Excel file and not in the Node directly in KNIME. The perfect way would be, that you can have all Rulesets in 1 Excel file, just in different Tabs and according to the Value in the Column for the Rule Determination, the Programm is choosing the correct tab with the rules. Is this even possible? I think I saw some examples, where you can store the Excel tab Name in variables… maybethese can then be used somehow as Triggers?

I hope this is understandable

Dennis6713 · May 2, 2019, 2:22pm

Hi all,

I got a new idea, but I dont know in it is possible or how it is possible in KNIME.

So here the question:

Is it possible to have a Rule Engine Node in a Loop in a way, that in every Loop another rule is checked? In yes, how? I was thinking, to have all rules in an excel file. One row = one rule. But so far I didn’t found a solution on how to read only single Rows from an excel file, only the whole file.

Thanks in advance,
Dennis

Dennis6713 · May 2, 2019, 6:27pm

I solved the problem now with just reading one line from an excel file with the Chunk Loop Start Node. Right now, I want to add a column for each rule, meaning in each loop step, 1 column should be added with the name of the current rule an the content of this new column should be, if the rule outcome is true or false.

My problem currently is, that the loop is just adding 1 column in the first step and afterwards overwriting this column. I tried to somehow handle it via flow variables, but couldn’t manage it.

Anyone has some ideas?

Thanks in advance,
Dennis

izaychik63 · May 2, 2019, 6:38pm

Try this node

Dennis6713 · May 2, 2019, 7:24pm

Where do I have to insert this? Instead of the Joiner?

The next problem I figured out is, that in the second step of the loop, the flow variable changed the name and therefore I get an error. In the first step, it is Row0 but it is than changing to Row1, Row2 etc…

See my settings attached.

Add empty rows:

	C1	C2	C3	C4	…	Cn
1	V1	X1	A1	…	…	…
2	V1	X2	A2	…	…	…
3	V2	X3	A3	…	…	…
4	V3	X4	A2	…	…	…
5	V4	X5	A1	…	…	…
6	V4	X5	A2	…	…	…
7	V4	X6	A3	…	…	…
8	V4	X7	A4	…	…	…
9	V5	X8	A1	…	…	…
…	…	…	…	…	…	…
n	Vn	Xn	An	…	…	…

	C1	C2	C3	C4	…	Cn
1	V1	X1	A1	…	…	…
2	V1	X2	A2	…	…	…
3	V2	X3	A3	…	…	…
4	V3	X4	A2	…	…	…
5	V4	X5	A1	…	…	…
6	V4	X5	A2	…	…	…
7	V4	X6	A3	…	…	…
8	V4	X7	A4	…	…	…
9	V5	X8	A1	…	…	…
…	…	…	…	…	…	…
n	Vn	Xn	An	…	…	…

	C1	C2	C3	C4	…	Cn
1	V1	X1	A1	…	…	…
2	V1	X2	A2	…	…	…
3	V2	X3	A3	…	…	…
4	V3	X4	A2	…	…	…
5	V4	X5	A1	…	…	…
6	V4	X5	A2	…	…	…
7	V4	X6	A3	…	…	…
8	V4	X7	A4	…	…	…
9	V5	X8	A1	…	…	…
…	…	…	…	…	…	…
n	Vn	Xn	An	…	…	…

	C1	C2	C3	C4	…	Cn
1	V1	X1	A1	…	…	…
2	V1	X2	A2	…	…	…
3	V2	X3	A3	…	…	…
4	V3	X4	A2	…	…	…
5	V4	X5	A1	…	…	…
6	V4	X5	A2	…	…	…
7	V4	X6	A3	…	…	…
8	V4	X7	A4	…	…	…
9	V5	X8	A1	…	…	…
…	…	…	…	…	…	…
n	Vn	Xn	An	…	…	…

	C1	C2	C3	C4	…	Cn
1	V1	X1	A1	…	…	…
2	V1	X2	A2	…	…	…
3	V2	X3	A3	…	…	…
4	V3	X4	A2	…	…	…
5	V4	X5	A1	…	…	…
6	V4	X5	A2	…	…	…
7	V4	X6	A3	…	…	…
8	V4	X7	A4	…	…	…
9	V5	X8	A1	…	…	…
…	…	…	…	…	…	…
n	Vn	Xn	An	…	…	…

	C1	C2	C3	C4	…	Cn
1	V1	X1	A1	…	…	…
2	V1	X2	A2	…	…	…
3	V2	X3	A3	…	…	…
4	V3	X4	A2	…	…	…
5	V4	X5	A1	…	…	…
6	V4	X5	A2	…	…	…
7	V4	X6	A3	…	…	…
8	V4	X7	A4	…	…	…
9	V5	X8	A1	…	…	…
…	…	…	…	…	…	…
n	Vn	Xn	An	…	…	…