Spark: Join two RDDs without key, just by sequence of rows

           
           
           

I would like to join two RDDs of same number of rows, but I do not have a matching key.

First RDD:

DIM1 DIM2 DIM3
1.34 1.45 -3.45
3.4 5.6 -1.2

Second RDD:

A B C
abc abc1 abc2
qwer qwer1 qwer2

The resulting RDD should be:

 

 

The resulting RDD should be:

DIM1 DIM2 DIM3 A B C
1.34 1.45 -3.45 abc abc1 abc2
3.4 5.6 -1.2 qwer qwer1 qwer2

The "Sparker Joiner" node does not support this. Any other idea?

Best, Frank

Hi,

hm, I am not even sure this is possible in all cases. You can try to use the Spark SQL node to add a generated (0 based) ID column to each RDD:

SELECT *, monotonically_increasing_id() as id from #table#

Then you can do an inner join on the id columns. Whether this gives the desired result unfortunately depends on both RDDs having the same number of partitions and same number of rows per partition.

 

Best,

Björn