Pyspark Script issue due to limited number of ports

JaeHwanChoi · September 5, 2023, 12:33pm

Hello KNIME Support.

The limited number of Pyspark Script ports is causing a major issue for a project we are working on at a client.

There are only 2 ports available at most, and currently we need to export 4 tables through the port due to the ML algorithm within Pyspark Script.

Each table has a constantly changing dataset, so the number of rows and columns and their respective data types are also constantly changing.

I tried to join two tables and use one port, but the number of rows can be 20 million, so I excluded it because it would be an inefficient use of resources.

Also, using code such as write.csv to store data in a specific space is not a behavior that should be done in the current customer environment.

Therefore, we need to export 4 tables only through the port of Pyspark Script, is there any effective way to do it?

We have spent too much time with this issue and still haven’t found a solution. I hope you can help me with this.

Your response will be greatly appreciated.

sascha.wolke · September 5, 2023, 2:03pm

Hi @JaeHwanChoi,

Do you have some example code? Don’t know a Spark ML algorithm that produces four different data frames. Maybe you can split your code and run it in individual nodes?

Cheers,
Sascha

JaeHwanChoi · September 5, 2023, 2:34pm

Thank you for your response. @sascha.wolke

It doesn’t matter which 4 datasets come from the Spark ML algorithm used.

In other words, in simple terms, you are exporting 4 independent datasets to 2 ports.

As a simple example, you can use Python’s built-in data iris,titanic,mpg,diamonds
which is Python’s built-in data, is mixed and exported to two ports.

Unfortunately, the code cannot be split and used as individual nodes.

sascha.wolke · September 6, 2023, 7:39am

Hi @JaeHwanChoi,

You can use four different PySpark nodes, and read one dataset per node. Then you have four nodes with one single dataset output each.

Cheers,
Sascha

JaeHwanChoi · September 6, 2023, 8:07am

Hi @sascha.wolke
Sorry, I didn’t make it clearer.

I used python’s built-in data to illustrate the idea of four independent datasets.

This means that when you run the analysis code inside the Pyspark Script, there are four output tables to pull from. This means that the 4 output tables are all unrelated data, just like each of the Python built-in data in the example above.

You shouldn’t use multiple Pyspark Script nodes because it’s very resource inefficient to use two Pyspark Script nodes to pull two of each, which is one of the few alternatives.

sascha.wolke · September 6, 2023, 11:48am

Hi @JaeHwanChoi,

From a Spark perspective, it does not make any difference if you use one or four PySpark snippets. The code gets converted into a Spark internal dependency graph and optimized anyway. You can see this if you connect multiple Spark nodes. The nodes are executed very fast, and only the last node that produces some output, triggers the final execution and takes some time to finish. Spark calls this lazy evaluation.

Cheers,
Sascha

system · December 5, 2023, 11:49am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.