Parallel SQL statement execution

Avit · October 28, 2016, 5:52pm

I have been struggling with getting both a small memory footprint and parallel SQL-statement execution. Using KNIME to orchestrate ETL for many sources, I need to process in parallel to stay inside the time window. At the same time keep the memory usage of KNIME (instances) down – memory used by KNIME/Eclipse cannot be used for caching/efficiency.

Using the jTDS driver to connect to SQL Server there is a hard-coded limitation: ‘one user’ equals ‘one statement’ at the time. There seems to be no way around this limitation (please let me know if there is a way). Meaning, if you want to process in parallel you need to start multiple instances of KNIME (java VM) – one parallel statement per instance. This brings up the memory footprint, especially for large workflows.

For connecting to SQL Server, the Microsoft JDBC drivers (sqljdbc_6.0) allow a work-around. Also this driver pools connection to the same server, but does it comparing more than just the host and username. Simply adding a dummy parameter with unique value to the connection string, will allow multiple connections and therefor multiple parallel statements to be executed. Now one workflow running in a single KNIME instance can execute statements within the time windows and with ‘low’ memory consumption.

Connection string: jdbc:sqlserver://<host>\<instance>:<port>;<other options>;x=<chunck-id>

Where <chunck-id> is a unique value to trigger a new connection to be made.

Please do reply to this post with other solutions if available.

tobias.koetter · December 22, 2016, 3:00pm

Hi,

KNIME is caching the connection to the database based on the user name, password and jdbc url which is why you can only execute one DB node at a time when connecting with the same user to the same db. Your workaround with using different jdbc parameters results in as many connections as you have different parameters e.g. chunk_ids. The reason for the connection caching is not to create a new connection everytime a node is executed. However this comes with the costs of having only one connection available right now. We are aware of this limitation and plan to support connection pooling in the future. What you can do already if the used database driver supports concurrent operations per transaction is to enable this in KNIME by adding the following flag to the knime.ini file which is located in the KNIME installation directory:

-Dknime.database.enable.concurrency=false

Bye

Tobias