checking Values in 2 columns

Shadi · July 14, 2023, 11:34am

Hi all,

I am trying to compare 2 columns to make sure that all data in column A also exist in column B but not necessary in the same row

Column A does not contain any duplicates
Column B contain duplicates and that is fine
I just want to make sure that each value in column A also exist somewhere in column B.

Could anyone help me with this?
Thanks

AnotherFraudUser · July 14, 2023, 1:27pm

Hi @Shadi

i think you are looking for the reference row filter

just use the same table for both inputs

FrankColumbo · July 14, 2023, 1:39pm

Hello @Shadi

I am taking your question literally. “Does each value in Column A exist in Column B?” the only possible answers are True or False. Also, I am using the rule Column A does not contain duplicates.

I have made examples and have put the in the hub here: checking values in 2 columns – KNIME Community Hub

Because, this is data in, examine data, boolean out I have made a meta node that will return true or false.

The picture below show what is inside the metanode (in a badly draw red oval)

*Edit 1459 BST I realised I uploaded the wrong version. Correct version is now in the hub

Hope this helps

Frank

P.S.

If anyone has any suggestions on making my flow better / more efficient - please let me know. Every day is a learning day.

takbb · July 14, 2023, 2:32pm

Hi @Shadi

You could also see what is missing using just a joiner node:

Comparison of Column1 vs Column2

Perform outer Join as follows:

Return just Column1 from one table and Column2 from the other.

And then the middle and lower ports will show you which items exist in only one or other of the columns

FrankColumbo · July 14, 2023, 2:34pm

@takbb

you are The Man

Shadi · July 14, 2023, 2:54pm

Thank you all for your support

takbb · July 14, 2023, 2:55pm

lol, thank you @FrankColumbo you are too kind!

I am impressed at your flow which effectively emulates a sort-merge join. One thing though, is that there is a very small flaw in the logic, in that if it were presented with data such as this:

which still satisfies the “A doesn’t contain duplicates but B can” specification, this would say that there are items present in one but not in the other.

This is because of the missing value in A which by definition must be allowed since A must have missing values if it contains no duplicates, but B does, and also because if B contains duplicates then a sort without removing duplicates would always see rows being “offset” when they are appended and tested at your Rule Engine.

To avoid that, I think you would need to add a duplicate row filter on your lower branch and also remove “missing rows” with a row filter on both branches:

Then I think it would be good to go.

(all that said, I may be wrong in how we should treat missing values! Are they considered to be values in their own right, in which case my solution isn’t quite correct either as missing values don’t match in a joiner)

takbb · July 14, 2023, 2:59pm

You are welcome @Shadi

system · October 12, 2023, 3:00pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.