I used python’s built-in data to illustrate the idea of four independent datasets.
This means that when you run the analysis code inside the Pyspark Script, there are four output tables to pull from. This means that the 4 output tables are all unrelated data, just like each of the Python built-in data in the example above.
You shouldn’t use multiple Pyspark Script nodes because it’s very resource inefficient to use two Pyspark Script nodes to pull two of each, which is one of the few alternatives.
From a Spark perspective, it does not make any difference if you use one or four PySpark snippets. The code gets converted into a Spark internal dependency graph and optimized anyway. You can see this if you connect multiple Spark nodes. The nodes are executed very fast, and only the last node that produces some output, triggers the final execution and takes some time to finish. Spark calls this lazy evaluation.