How to reduce table conversion time for Spark-related nodes

mlauber71 · August 16, 2023, 5:23am

@JaeHwanChoi where is this Spark process being done? On your local machine or on a server?

You need to take into account how Spark works with Lazy evaluation - Wikipedia. That means the process will only start once you want something done like writing the data to a table. The other spark nodes which seem very fast are just ‘plans’ until execution starts.

Using Spark will always come at an additional cost by the machine setting it up and starting the workers - the benefit is you can process very large amounts of data by using the memory of your hopefully many distributed servers. The performance very much depends on which resources you have in the form of RAM and processor and how many …

You can also write the result to a big data System, a table with Hive or Impala - which would be a standard procedure.