Spark: Join two RDDs without key, just by sequence of rows

frank · November 5, 2017, 2:15pm

I would like to join two RDDs of same number of rows, but I do not have a matching key.

First RDD:

DIM1	DIM2	DIM3
1.34	1.45	-3.45
3.4	5.6	-1.2

Second RDD:

A	B	C
abc	abc1	abc2
qwer	qwer1	qwer2

The resulting RDD should be:

frank · November 5, 2017, 2:26pm

The resulting RDD should be:

DIM1	DIM2	DIM3	A	B	C
1.34	1.45	-3.45	abc	abc1	abc2
3.4	5.6	-1.2	qwer	qwer1	qwer2

The "Sparker Joiner" node does not support this. Any other idea?

Best, Frank

bjoern.lohrmann · November 8, 2017, 1:46pm

Hi,

hm, I am not even sure this is possible in all cases. You can try to use the Spark SQL node to add a generated (0 based) ID column to each RDD:

SELECT *, monotonically_increasing_id() as id from #table#

Then you can do an inner join on the id columns. Whether this gives the desired result unfortunately depends on both RDDs having the same number of partitions and same number of rows per partition.

Best,

Björn

system · June 2, 2023, 9:03pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.