Parsing/Grouping Column Values Into Variables For Rule-Based Row Filter Question

I’ve tried searching for this solution on the forums and I’m probably blind or just really missing something, but I need help with solutioning a method for comparing values.

I have an Excel spreadsheet that contains 2 sheets (MF1, Validation Data). The MF and sheet contains rows of data relating to employee records. The “Validation Data” tab contains a table with what are considered valid values for most of the columns in the MF sheet. The columns it has are:

TabName - (referencing which Excel sheet the valid codes are for, in this case “MF”)
FieldCheck - (the column name the valid codes are for)
ValueCheck - (a valid code that for that column)
Info - (contains additional values that can be used for validation)
Info2 - (contains other additional values use for comparing/validating codes)

An example might be two rows for each valid Gender code for the Gender column on the MF1 tab. For example:

MF1, Gender, M
MF1, Gender, F

Some columns can have hundreds of valid codes/values that would need to be used for comparison in the Rule-Based Row Filter.

In an ideal world, I would like to be able to write a Rule-Based Row Filter expression that would allow me to output rows from the MF1 tab that don’t match the valid codes on the Validation Data tab. In addition to this, I would need the Variable to have the name of the column it is being used to validate codes on, so expression writing is straightforward.

I’m really sorry if I’m missing something super obvious (and I feel like I must be), but I’m just not getting why I can’t figure this out.

Any help you could offer would be greatly appreciated.

Hi @CoreJeremy

Welcome to the KNIME Community!

Could you share an anonymized examples of your input data and expected output? People are generally more willing to help you out if you help them as much as possible :wink:

At face value I’d say the number of rule-based row filtering should be fairly limited. Let’s assume I have an input table (left) and a validation table (right) like this:

For each column, you can perform a left outer join to determine if the value is “valid” according to the validation table.

The end result would be like this:

image

Meaning, only the second row should be considered as valid. As such, the subsequent rule-based row filter would need to consider the not-null rows only based on the validation columns.
In this example, that would be through NOT MISSING $val_column1$ AND NOT MISSING $val_column2$ => TRUE.

I’d say flow variables etc. is a step further but let’s first get your use case covered :wink: Again,. some more insights and input and desired output goes a long way.

2 Likes

Thank you for your response ArjenEX, I apologize about the radio silence. After a LOT of trial and error and mixing and matching, I was able to come up with a solution. There is probably some refinement that could still be done, but for now it works.

My “raw” data looks like this:
image

Here is what I came up with:

  1. I used GroupBy’s and Rule-based Row Filters to split the soon-to-be variables into their different types (string, numerical, etc).

  2. Then used a “Collection to String” node that I found here on the Community Hub to modify the way the Collection was stored (different data types require different adjustments to be usable as a variable). This step takes a set like this, [a,b,c] and turns it into a literal string like this, “a”,“b”,“c”.

  3. Then I transposed the table(s) and used the “Row to Column Names” node to apply a header, which would then become the Variable Name I needed to have access to.

  4. Finally, after appending the columns of the different streams together, I used a “Table Row to Variable Loop Start” node to iterate over the singular row of Collections. The result is a long list of Variables I can use in a Rule Engine to compare to the data submitted to me and output error messages when the values submitted don’t match the valid values I have on my end.

This is what the finished metanode looks like:

The variables were now ready to be ported into a Rule Engine to be used on the incoming data set.

I could now compare the valid value set variables to the incoming rows. Where the value isn’t valid, it will store the row along with the error from the Rule Engine identifying what is wrong with the record.

I think to take this a step further would be to find a way of having multiple errors be concatenated in the event more than one of the rules is violated. But that will likely take some more research.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.