How to export to dict data port in Pyspark Script

Hello KNIME Support Team.

I am using Pyspark Script and want to put all the data from one dictionary in one column. Pyspark Script’s output format must be dataframe so I can output port, but I want to put it in one column and make it dataframe.

I have some code that I have applied in Python, but I am having difficulty applying it to Pyspark.

How can I put the above values into a single column like below?

Any help would be appreciated.

Hi @JaeHwanChoi,

You might like to export the details as individual columns using something like this in the PySpark Script Source:

models = [
  { 'name': 'model_1', 'indicator_1': 0.65, 'indicator_2': 0.63, 'indicator_3': 0.88 },
  { 'name': 'model_2', 'indicator_1': 0.83, 'indicator_2': 0.76, 'indicator_3': 0.93 }
]
resultDataFrame1 = spark.createDataFrame(models)

This results in such a table:

+-----------+-----------+-----------+-------+
|indicator_1|indicator_2|indicator_3|name   |
+-----------+-----------+-----------+-------+
|0.65       |0.63       |0.88       |model_1|
|0.83       |0.76       |0.93       |model_2|
+-----------+-----------+-----------+-------+

And this schema:

root
 |-- indicator_1: double (nullable = true)
 |-- indicator_2: double (nullable = true)
 |-- indicator_3: double (nullable = true)
 |-- name: string (nullable = true)



If you really like to put everything into one column and row, you can do something like this:

models = {
	'model_1': { 'indicator_1': 0.65, 'indicator_2': 0.63, 'indicator_3': 0.88 },
	'model_2': { 'indicator_1': 0.83, 'indicator_2': 0.76, 'indicator_3': 0.93 }
}
resultDataFrame1 = spark.createDataFrame([{ 'dict_detail': models }])

This results in such a table:

+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|dict_detail                                                                                                                                             |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|{model_2 -> {indicator_1 -> 0.83, indicator_2 -> 0.76, indicator_3 -> 0.93}, model_1 -> {indicator_1 -> 0.65, indicator_2 -> 0.63, indicator_3 -> 0.88}}|
+--------------------------------------------------------------------------------------------------------------------------------------------------------+

And this is the schema:

root
 |-- dict_detail: map (nullable = true)
 |    |-- key: string
 |    |-- value: map (valueContainsNull = true)
 |    |    |-- key: string
 |    |    |-- value: double (valueContainsNull = true)```

You can find more information about PySpark in the official documentation: PySpark documentation

Cheers,
Sascha

2 Likes

Thank you for your response. @sascha.wolke

I’m new to Pyspark and it’s tricky to get it to work with KNIME, so I have one additional question.

When I pull your code into the Pyspark Script in KNIME, the “:” is replaced with “->”. Is this an inherent problem?

Thanks.

Hi @JaeHwanChoi,

When I pull your code into the Pyspark Script in KNIME, the “:” is replaced with “->”.

I guess you mean the second variant? The single column/row with the dict/struct cannot be used in KNIME, there is no similar type to represent this in KNIME. You might be able to use it in other PySpark nodes, but not in KNIME. The value gets converted into a String in KNIME, if you use the Spark to Table node or the preview, and then contains the arrows in this string. But it is not really usable in KNIME in this way anymore, as it is only a string.

Suggesting going with the first variant, as it output a flat table that can be used directly in KNIME.

Cheers,
Sascha

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.