I would like to join two RDDs of same number of rows, but I do not have a matching key.
First RDD:
DIM1 | DIM2 | DIM3 |
---|---|---|
1.34 | 1.45 | -3.45 |
3.4 | 5.6 | -1.2 |
Second RDD:
A | B | C |
---|---|---|
abc | abc1 | abc2 |
qwer | qwer1 | qwer2 |
The resulting RDD should be:
I would like to join two RDDs of same number of rows, but I do not have a matching key.
First RDD:
DIM1 | DIM2 | DIM3 |
---|---|---|
1.34 | 1.45 | -3.45 |
3.4 | 5.6 | -1.2 |
Second RDD:
A | B | C |
---|---|---|
abc | abc1 | abc2 |
qwer | qwer1 | qwer2 |
The resulting RDD should be:
The resulting RDD should be:
DIM1 | DIM2 | DIM3 | A | B | C |
---|---|---|---|---|---|
1.34 | 1.45 | -3.45 | abc | abc1 | abc2 |
3.4 | 5.6 | -1.2 | qwer | qwer1 | qwer2 |
The "Sparker Joiner" node does not support this. Any other idea?
Best, Frank
Hi,
hm, I am not even sure this is possible in all cases. You can try to use the Spark SQL node to add a generated (0 based) ID column to each RDD:
SELECT *, monotonically_increasing_id() as id from #table#
Then you can do an inner join on the id columns. Whether this gives the desired result unfortunately depends on both RDDs having the same number of partitions and same number of rows per partition.
Best,
Björn
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.