Optimize database writer execution in multiple operations

Hi collegues,

I have noticed that when I use the database writer node for write a huge volume of data the workflow gets really slow. This problem is quite manageable when the volumes of data is not so huge.

So, I got an idea but I don' t know if in knime is possible to perfom it.

What if I split this single huge volume of data in more than one database writer operations, and then I execute them in parallel?

This operation should consume more resources but the time execution should become reasonably lower.

Someone can exeplan me if it's possibile or not? And if it's possible can you suggest me an example workfloe for perform it in concrete?

 

Thanks in advice,

-Giulio  

Hi Giulio,

I don't think that your proposed solution will have any noticeable speedup due to the way that the database nodes are currently implemented. Probably the easiest way to get a noticeable speedup in database write performance is to increase the 'Batch Size' in the Database Writer node. In case you are using e,g. Amazon Redshift note that they recommend an alternative way to import large volumes of data. Also possible is to use the Database SQL Executor to import directly from e.g. a CSV file.

Best,

Jon 

Thank you @Jon for your reply,

I have already tried to increase batch size on advanced settings but it doesn't positively affect the execution time.. for write 3/4 mln of rows we talk about several hours..

what you mean exactly for "import a csv file with the Database SQL Executor node"?

Since I think that you cannot import a file using that node, but you can just put a SQL statement.

If I use a csv writer node the execution is extremely faster than writing on database, but in my case I need to write a table on a mysql database.

Can you show me an example please? 

Thanks in advice

-Giulio

Not sure if this is relevant, but even when I attempted to increase the batch size, what happened is that the writer started skipping rows with this increment, so for example if I have 500000 rows of data, and a batch size of 100000, I would end up with a table with only 5 rows on the actual DB. Any idea if I need to set something up differently to make it work? My database is on Microsoft Azure in case that helps.