Dynamic, Scalable Column Validation Rules in KNIME (Handling Multiple Errors per Row)

I have a table where I need to detect errors across multiple columns. My goal is to end up with a column that indicates the error, the column number/name, and the reason for the error.

I’m currently trying to do this using the Column Expressions node, but the code is becoming too long and will probably be difficult for someone else to maintain later. The examples below are only to explain the idea; they are not real cases.

Example:

If column B is different from "Spanish", then it should return an error like:

Error - B - Different from "Spanish"

I also have conditions that evaluate multiple columns.

For example, if column C is "Complaint", then column D must be "Negative" and column E must be "Dissatisfied".

Expected result:

Error - C/D/E - When C is "Complaint", D must be "Negative" and E must be "Dissatisfied"

The issue is that a single row can have more than one error. So, for row 1, the final result could be something like:

Error - B - Different from "Spanish"
Error - C/D/E - When C is "Complaint", D must be "Negative" and E must be "Dissatisfied"

Is there a way to implement these validation rules dynamically? For example, by reading them from an external table instead of hardcoding everything inside Column Expressions?

I’m attaching the Column Expressions code I’m currently using as a reference.

code Column Expresions.txt (51.0 KB)

1 Like

what you are looking for is the the Rule Engine, and Rule Engine Dictionary nodes.

Usually, the dictionary version is the one to go for in this case. But, this node has one limitation in the sense that once a rule matches to a row, no other rule will be applied to that row.

For your use case, you want to do the following:

add a column “aggregated errors” to your data table.
second table with your rules.
pipe both into an iterative loop.
split the first rule from your rule table and use the rule engine dictionary to make your error check, return the error into a new column “error”.
add a Column Aggregation that Concats the result (or create an array etc.).
pipe back your data table and the remaining rules.
for the iterative loop end, you only want to collect the last result and depending on your amount of rules, you may need to increase the maximum iteration counter.

The benefit is that you can even write your rules outside of Knime and just load them.

Depending on the size of your table and the amount of rules you want to check, this approach with a loop might not be suitable from a performance point of view, and a different approach will be better.

an alternative approach (not necessarily better in the sense of writting everything together: using the Expression Node (not Column Expression node), you can append 1 column, and then do one operation per rule, all doing the similar approach you already did with the column expression (appending the previous result). but you have one “edit box” per rule instead.

2 Likes

Hy Guys,

I saw the file with the expressions that you will use and can say something about it to improve the validation.

1- Make the trasnformation / create new columns with the final values to be compare. It will make easier to split logical from data informaiton.

2- you can use a column expression legacy to create at once all condicional for each case, sort the size about your programming code and avoid problems with variabes and conditionals.

3- for eache validation (new coluns with return), you can create a rule engine validation to see each result and assign a value for erros that will be null or ‘0’ as empty problems or you can contact or just set a value that show that you have some problems (like a count).

It will help you for debug part of the validation and at the end, you can sum or concatenate string about erros returns.

Some “IF questions” can use regular expressions to validate with to sort data. Example:

(
contains(lowerCase(column(“Descripción”)), “oxxo”) ||
contains(lowerCase(column(“Descripción”)), “fruna”) ||
contains(lowerCase(column(“Descripción”)), “mayorista 10”) ||
contains(lowerCase(column(“Descripción”)), “acuenta”) ||
contains(lowerCase(column(“Descripción”)), “jumbo”) ||
contains(lowerCase(column(“Descripción”)), “tottus”) ||
contains(lowerCase(column(“Descripción”)), “unimarc”) ||
contains(lowerCase(column(“Descripción”)), “líder”) ||
contains(lowerCase(column(“Descripción”)), “lider”) ||
contains(lowerCase(column(“Descripción”)), “santa isabel”) ||
contains(lowerCase(column(“Descripción”)), “local savory”) ||
contains(lowerCase(column(“Descripción”)), “punto copec”) ||
contains(lowerCase(column(“Descripción”)), “alvi”) ||
contains(lowerCase(column(“Descripción”)), “ganga”)
)

will be:

regexMatcher(contains(lowerCase(column(“Descripción”))),“(oxxo|fruna|mayorista 10|acuenta|jumbo|tottus|unimarc|líder|lider|santa isabel|local savory|punto copec|alvi|ganga)”)

So you can short the lines to just words.

The result will be a false or true and you can set it into a new column output or replace a one.

Here you can create more expressions for each needs, just click at the plus sign (+) above expression column.

After that node, you can creat another to sum all columns values for the final answer.

Another tip about conditionals is use a single question to validade a conditional and the anser will be write into 2 parts. Example:

column(“Descripción”)

// Establecimiento: Si es Punto de compra indefinido, Descripción no debe mencionar tiendas conocidas

result = errores == “” ? join(“ERROR”, " - ", “V”, " - ", “Establecimiento”, " - ", “Punto de compra mencionado en Descripción”) : join(errores, " || ", join(“ERROR”, " - ", “V”, " - ", “Establecimiento”, " - ", “Punto de compra mencionado en Descripción”));

result => will return the answer

errores == “” ? => the question for the if situation

: => 2 answers, before the “:” the first for true, and the second part for false

I try to give you some tips to short the code and a information to make transformation before the validation to not need to do it inside the code

If I can help more, just call me here again ok?

If you can upload a sample about your data (fiction ones)

Thanks,

Denis

1 Like