Hello KNIME Support
I would like to inquire about the specs for Pyspark Script.
Currently, we have a workflow that performs ML analysis through big data. There are 2 Pyspark scripts used for preprocessing and modeling.
I am using it in connection with Livy, but when I put about 3 million data in the simplest model, Random Forest, and run it, the Spark driver uses more than 65GB of memory. (1 executor, 2 cores)
Big data bigger than 3 million cases will require a very large memory, is it normal to use so many resources?
Or is it because the output of big data is not deleted when one Pyspark script is finished, but is stored until the end of the analysis, so the memory is used so much?
Can anyone tell me why the driver is using so much memory, or is there a way to reduce it?
Or is there a reference to the approximate Spark resources used per data?
Any help would be greatly appreciated.