I am trying to compare 2 columns to make sure that all data in column A also exist in column B but not necessary in the same row
Column A does not contain any duplicates
Column B contain duplicates and that is fine
I just want to make sure that each value in column A also exist somewhere in column B.
Could anyone help me with this?
i think you are looking for the reference row filter
just use the same table for both inputs
I am taking your question literally. “Does each value in Column A exist in Column B?” the only possible answers are True or False. Also, I am using the rule Column A does not contain duplicates.
I have made examples and have put the in the hub here: checking values in 2 columns – KNIME Community Hub
Because, this is data in, examine data, boolean out I have made a meta node that will return true or false.
The picture below show what is inside the metanode (in a badly draw red oval)
*Edit 1459 BST I realised I uploaded the wrong version. Correct version is now in the hub
Hope this helps
If anyone has any suggestions on making my flow better / more efficient - please let me know. Every day is a learning day.
You could also see what is missing using just a joiner node:
Comparison of Column1 vs Column2
Perform outer Join as follows:
Return just Column1 from one table and Column2 from the other.
And then the middle and lower ports will show you which items exist in only one or other of the columns
Thank you all for your support
lol, thank you @FrankColumbo you are too kind!
I am impressed at your flow which effectively emulates a sort-merge join. One thing though, is that there is a very small flaw in the logic, in that if it were presented with data such as this:
which still satisfies the “A doesn’t contain duplicates but B can” specification, this would say that there are items present in one but not in the other.
This is because of the missing value in A which by definition must be allowed since A must have missing values if it contains no duplicates, but B does, and also because if B contains duplicates then a sort without removing duplicates would always see rows being “offset” when they are appended and tested at your Rule Engine.
To avoid that, I think you would need to add a duplicate row filter on your lower branch and also remove “missing rows” with a row filter on both branches:
Then I think it would be good to go.
(all that said, I may be wrong in how we should treat missing values! Are they considered to be values in their own right, in which case my solution isn’t quite correct either as missing values don’t match in a joiner)
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.