Match with items in a list between columns

nightsky3 · January 7, 2022, 7:38am

I’m looking to see if there are matches between 2 specific columns I’m interested in.

I used the pivoting node to bring all the different values into a list. The second column also has a list. So basically comparing if any items within either lists match.

Now I’m trying to use the Rule Engine Node - so that any time that the item in the second column is found within the list of the 1st column - it appends a new column and says Yes or No.

I tried with the writing below in Rule Engine, but it doesn’t seem to be working for me.

$Column1+List$ IN $Column2+List$ => “Yes”
$Column1+List$ IN $Column2+List$ => “No”

Still quite new to KNIME, so any help is hugely appreciated!

omprakashjena · January 7, 2022, 7:54am

Here is the workflow attached.

columnMatch.knwf (11.6 KB)

nightsky3 · January 9, 2022, 4:22pm

Thank you, but its telling me the first column that I’m trying to match with is not a string. Both columns have been aggregated into lists.

“Expression before MATCHES is not a string”

Thyme · January 9, 2022, 4:57pm

That’s because the first argument has to be a string (or something that can be made into a string). The operator “IN” is asking the question “Does this String also appear in that List of other Strings?”.

It looks like you are asking “Does this List and that other List have any overlaping elements?” which requires a different approach. Do you happen to have any sample data you can share with us? Preferably in a state before you aggregate the Columns into Lists. Doesn’t have to be real, a few rows of made-up data is fine as well.
Oh, and do you want to compare the entire Column1 with Column2, or do you want to do that in groups? If so, what’s the condition?
BR, T

nightsky3 · January 10, 2022, 12:44am

Essentially my aim is to find whether for each country - do the crash causes between its car and bus crashes overlap. I want to remove any that don’t.

For example in the picture below - Japan has the same crash cause between Car & Bus being Pedestrian - so because the crash causes match between Car & Bus - I want to keep these rows.

Thailand and S.Korea don’t have matches between Car & Bus - I want to remove them.

That’s why I thought if I could make a new column saying Match - I could then remove any rows that said No.

Please find an image of sample data below.

Capture22

bruno29a · January 10, 2022, 5:19am

Hi @nightsky3 , since you are not just checking columns, but rows also this is better done via a Join, and best done via a Group By.

I created a workflow that does both methods and looks like this:

Both gives the same results, but the GroupBy is much more faster.

Explanation as follows.
GroupBy Method:
The idea here is to do a Group By Country and Crash Cause and just do a unique count on Transport. If the unique count is greater than 1, it means you have both Car and Bus as Transport for the same Country and same Crash Cause:

And since it is what we want, just need to filter on those that have unique count 2 via a Row Filter:

Then just join back with the original data, on Country and Crash Cause, and we get:

Join Method:
As with the GroupBy, we want to check which Country and Crash Cause combination have both Car and Bus as Transport. We just Split via the Row Splitter the data into 2 tables, one for Car records and one for Bus records:

We then join both tables on Country and Crash Cause. Whatever results we get will be the Country and Crash Cause that have both Car and Bus as Transport:

If you compare with the GroupBy Results, that’s what we got there too (2 x {Japan + Pedestrian}).

We do a de-duplication, as we just want to know that it’s Japan and Pedestrian.

After that, we just join back with the original data on Country and Crash Cause, as we did in the Group By method, and we get the same results:

Here’s the workflow: Match with items in a list between columns.knwf (21.8 KB)

nightsky3 · January 10, 2022, 5:35am

Thank you, this worked like a charm!!!

system · January 17, 2022, 5:35am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.