Remove duplicate lines

JCanelhas · November 18, 2016, 11:36pm

Hi,

I have a dataset that contais 3 columns

Source	Target	Weight
1	2	2
1	3	1
3	2	2
2	1	2

These represent conections on network, but since the direction of the conection is irrelevant , I consider rows 1 and 4 duplicates, how can I find all duplicates and remove them from such a table ?

Thanks all in advance

Jorge

qqilihq · November 19, 2016, 2:15am

Hi,

implement a rule or a Java snippet to map the source and target column to a single identifier. Use the following rule:

if (source < target) {
    identifier = source + "-" + target;
} else {
    identifier = target + "-" + source;
}

So, for your table the following identifiers will be created:

1-2
1-3
2-3
1-2

Equal identifiers now denote duplicate rows per your definition. You can then use a grouper node to eliminate the duplicates.

Philipp

JCanelhas · November 19, 2016, 3:41am

Hi, thanks for the promp awnser, that solution is fine for numbers, but my data also includes text .

But since the data was from a list I've created ids and used your code.

Thanks once again

JC

ImNotGoodSry · December 5, 2016, 3:43pm

Hey Jorge,

you can connect a Column Aggregator node to your table. Here you choose your Source and Target columns as aggregation columns. As option you choose a List (sorted), that will work with strings and numbers. Now you can use the GroupBy node and use the newly created column as your group column. To keep your "Weight" column, use First as aggregation method.

Best,
Marc

remove_duplicate_lines.knwf