Update null values that were marked as duplicates by Duplicate Row Filter

J_U · July 26, 2023, 4:00pm

Hello,

I have multiple columns of string data that I need to check for duplicates occurring within each column (i.e., not across the row or in a different row/column).

I used the Duplicate Row Filter in a Column List Loop Start and got the desired output, i.e. each data column has a “duplicate-type-classifier” and “duplicate-type-identifier” column next to it showing the unique, chosen, and duplicate values for each row.

The issue I’m running into is that null values are also identified as duplicate values. I didn’t see a way to do prevent that in the Duplicate Row Filter, so I have been looking for a workaround to remove them.

I have multiple columns so I am picturing the final solution would use a loop (and possibly a variable?) to check all of them (e.g. if the data_col is null , set the duplicate-row-identifier to null, else use its existing value) but I wasn’t sure how to string the tools together to check and replace all of the columns.

Here’s the result I have:

data_col	duplicate-type-classifier	duplicate-row-identifier	data_col (#01)	duplicate-type-classifier (Iter #1)	duplicate-row-identifier (Iter #1)
ABC123	chosen	?	XYZ000	unique	?
DEF456	unique	?	JKL111	unique	?
GHI891	unique	?	?	chosen	?
ABC123	duplicate	Row1	?	duplicate	Row 3

Here’s the result I would like to get:

data_col	duplicate-type-classifier	duplicate-row-identifier	data_col (#01)	duplicate-type-classifier (Iter #1)	duplicate-row-identifier (Iter #1)
ABC123	chosen	?	XYZ000	unique	?
DEF456	unique	?	JKL111	unique	?
GHI891	unique	?	?	chosen	?
ABC123	duplicate	Row1	?	duplicate	?

The duplicate-type-classifier will not be included in the final output so it doesn’t necessarily need to be corrected.

Apologies if this is a little convoluted, I can post additional details if needed if anyone has an idea on how to make this work. Thanks for taking a look.

takbb · July 26, 2023, 4:10pm

Hi @J_U ,

What is the duplicate-row-identifier (Iter #1) column and how is it being derived? Why does it show “Row 3” in the current result? You haven’t mentioned it in the text as far as I can see, but it appears to be the only difference between your current and required tables, so difficult to work out how to assist in what you are trying to do.

Would you be able to post the sample of data before the duplicate row filter is applied, so we can see more clearly what is going on.

One other question… in your data, how many columns will contain some missing values that you need to include as non-duplicates?

ArjenEX · July 26, 2023, 4:28pm

Hi @J_U

I think I get where you came from. Starting point:

To solve this:

Add a Row Filter node within the loop
Configure it to filter out empty rows based on a variable column.

image639×607 18.9 KB

The Column List loop automatically creates a flow variable for the current column it handles. Use this one to override the ColumnName.

Make sure to have the appending version of the loop end. Final result:

Would this work?

J_U · July 26, 2023, 4:39pm

Hi @takbb ,

Thanks for your quick response, I have broken out your questions below for ease of reply:

What is the duplicate-row-identifier (Iter #1) column and how is it being derived?
The duplicate-type-classifier and duplicate-row-identifier columns are generated by the Duplicate Row Filter - I have “Keep Duplicate Rows” “Add column showing duplicates”, and “Add column identifying the ROWID of the chosen row for each duplicate row” selected for the node in its Advanced tab.

Why does it show “Row 3” in the current result?
That’s the question I am trying to resolve, it is puling in the ROWID of the chosen row for each duplicate row per the node settings above, which in my case is correct for rows that are non-null, but not for rows that are null.

You haven’t mentioned it in the text as far as I can see, but it appears to be the only difference between your current and required tables, so difficult to work out how to assist in what you are trying to do.
Correct, that is the only difference, and the one I am trying to fix - by either preventing the Duplicate Row Filter from identifying nulls as duplicates, or by applying a workaround to refine it afterwards (or other possible solutions).

Would you be able to post the sample of data before the duplicate row filter is applied, so we can see more clearly what is going on.
Sure, here is a brief sample:

data_col	data_col (#01)
ABC123	XYZ000
DEF456	JKL111
GHI891	?
ABC123	?

One other question… in your data, how many columns will contain some missing values that you need to include as non-duplicates?
There will be multiple columns, the number varies each time. Most likely it will be in the low hundreds.

J_U · July 26, 2023, 5:05pm

@ArjenEX many thanks! That is exactly what was needed. Thanks for your help!

system · August 2, 2023, 5:05pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.