Implementing Merge on Pyspark Knime Node

oshin · October 21, 2020, 10:20am

Hello,
I am trying to implement outer join between two data frames on KNIME Node. I have a list of keys to be joined on from df1 and df2 respectively as flow variables( since I am taking this as an input from user)
So basically for file A- I have filaA_key1,fileA_key2 & fileA_key3
similarly for File B I have fileB_key1,fileB_key2 & fileB_key3 ( am taking fixed 3 keys)

What I am looking for is a statement parallel to joindf = pd.merge(df1, df2, left_on = list1, right_on = list2, how = ‘left’) in Pyspark. So that I can define my left_on and right_on based on user Input. I referred to this link
Pyspark Merge Stack Overflow Article but to no avail Please see the below image attached.

How could I address this scenario, any help on Pyspark for the same for knime Node would be appreciated. I want to implement join on 2 df’s between multiple columns such that the columns names for the respective df’s come from a list , this list is built from flow variables (string type) .A new approach /code solution from scratch is also welcome.

Thanks!

gab1one · October 22, 2020, 11:55am

Hi @oshin,

This looks like an error with the python syntax, you can’t use the curly braces { like that in python. The post you linked used scala syntax, which is not valid python.

I recommend using a python ide like pycharm or vscode with python plugins to develop the python code and just paste it into the python node. That way you will get better error messages.

best,
Gabriel

sascha.wolke · October 29, 2020, 3:14pm

Hi @oshin,

do you need the PySpark Node? Maybe you can use the Spark Joiner instead.

Cheers,
Sascha

system · June 2, 2023, 9:00pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.