Encountered duplicate row ID while using X-Aggregator

Sezer · March 15, 2022, 4:42pm

This is how my workflow looks like:

I tried looking at the data itself but there are no duplicates in it. DBscan on the same data was able to produce an output.

Any help is much appreciated.

victor_palacios · March 15, 2022, 6:22pm

When you use Table to Spark, the rowids generated for each Spark table are new rowids starting from 0. This would lead to overlapping rowids. If you are using spark that implies your data is quite large and so I recommend instead that you use the Spark Partitioning node. Cross validation is great when your data is small-medium, but for larger data it is unnecessary in my opinion. As well, the set up would be quite complex (going from table to spark and spark to table may be costly) so I would avoid X-nodes.

Sezer · March 15, 2022, 7:03pm

Thanks for your answer.

Acctualy my Data is not that big, im using Spark because of the Collaborative Learner for my Movie Lens recommendation Engine.
Cross validation would be helpful in this case. Spark Partitioning Node dont support a slpit into more than 2 parts and after that im only able to use one of them for the loop. Is there a way to replace X-Partitioner in a Spark Context ?

victor_palacios · March 16, 2022, 5:16pm

Hello,

Unfortunatley, we do not have dedicated nodes for this with Spark DataFrame Input/Output.