Need Help with Optimizing Spark to Load Data Faster from On-Premise SQL Server

AliGad · July 3, 2024, 8:53pm

Hello KNIME Community,

I am relatively new to using Spark and need some guidance on optimizing my data loading process. I have an on-premise SQL Server database, and I am trying to use Spark to load data faster. Here is what I have done so far:

I created the data environment using Spark.
Connected Spark to my SQL Server database.

However, I have noticed that the running time for loading data remains the same, whether I use Spark or not. Additionally, when dealing with large datasets, I sometimes encounter deadlock errors.

I would appreciate any advice on how to use Spark more effectively to load data faster from my SQL Server. Specifically, I am looking for:

Best practices for optimizing Spark for data loading.
Any specific configurations or settings in Spark that could help reduce the running time.
Strategies to avoid deadlock errors when handling large datasets.

I have attached a screenshot of my workflow for your reference.

Thank you in advance for your assistance!

mlauber71 · July 4, 2024, 4:06am

@AliGad using Spark will not result in faster processing especially not if you use the local big data environment which is there to demonstrate the possibilities not to be used in production.

Spark will always add a layer of overhead to the job that only does make sense if you have several big data servers (like a Cloudera system) and your data is so large it cannot be processed otherwise.

You should explore options like doing the data extraction in chunks and use possible WHERE clauses early to limit the data used.

Other options could be streaming so only to process a limited amount of data at a time but do that faster.

system · October 2, 2024, 4:06am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.