Critical Problem with the DB Reader Node (2.9.0 - 2.9.2) @ MYSQL JDBC

dp · April 22, 2014, 8:30am

Dear Knimers,

I encountered a problem with the DB Reader recently. It does not occur very often and it is quite hard to track, but when it occurs, it breaks my production jobs.

This is the thing: One of our batch jobs needs to read in 4 Tables from the MySQL Database of ours. The first and the second of these Tables are quite large (The queries return about 20 Million rows x 15 columns). From time to time, the second Database Reader Node fails on doing that with an error message like

"ERROR KNIME-Worker-6 Database Reader Execute failed: Communications link failure The last packet successfully received from the server was xxx,xxx milliseconds ago"

The xxx numbers vary, but mostly are between 400 and 2000 seconds. What I already tried:

- Connecting the Database Readers by FlowVariable Ports, so that they only get executed when the one before is finished. This might have added some stability, but the Nodes still fail from time to time.
- Increasing the Database timeout in the knime.ini: "-Dknime.database.timeout=600". Funny thing is, that I often get millisecond readings in the error message that are lesse than 600 seconds ago.
- Upgrading to the latest version of Knime

The problem seems to occur more often if the database is used by other Knime processes at the same time or when it is used interactively (not in a batch job). When executing the queries directly in the Database, the queries always run without a problem. Also, table locking is not a problem as I checked this already. It is also not a typical MySQL problem of "wait_timeout" as this is set to 28,800 seconds. The tables in question are inno_db tables.

I have three questions regarding this:

1) Might this be connected to the behaviour of the connection pool (i.e., not automatic reconnect?) => Would modifying the databse url in the reader by appending "?autoReconnect=TRUE" help?

2) Are there any possibilities to override the connection pool so that for every request in Knime a new connection is established? I have no Problem with many connections, but with the connection pool, I think.

3) On a related topic: Why is parallel reading from database Nodes not possible by default? The most database engines are constructed to handle concurrent reading quite well. In my case, not even using two different connections (Master and Slave) run in parallel.

Thanks in advance and best regards,

Dennis

gabriel · April 23, 2014, 12:09pm

1) autoReconnect=true does not seem to solve the problem, I agree. I read through some forums and this seems to have some drawbacks. The timeout is set on the connection and we rely on the implementation of the underlying JDBC driver.

2) Similar to the database timeout, one can specify a flag called "-Dknime.database.enable.concurrency=true/false" in the knime.ini which switches the concurrent database connection access (applies only for the same database connection) on or off. Default is true, that is all database accesses are synchronized based on single connection; false means off, that is, the access is not synchronized and may lead to database errors.

3) This is on design and we leave to the KNIME workflow developer to switch the parallel execution on or off. Just to be on the safe side, we decided to enable this "feature". One reason is because the Database Reader might be used in Parallel Loop constructs and executed in parallel which for some database does not work.

dp · April 24, 2014, 8:04am

Hi Gabriel,

thanks for your reply. What can I do to make the DB reader working again? Will further increasing the DB timeout help? What does this timeout mean exactly? I don't think it is the time the DB server is allowed to complete the query, because I had queries that took muche longer than 600 seconds and they completed correctly.

I am really stuck on this at the moment and would appreciate any help on this.

Cheers, Dennis

thor · April 24, 2014, 9:28am

Asking the internet about this error message tells me that it's a server issue and not a client problem. The suggested solution is to increase the value of the "net_write_timeout" parameter on the server (!).

dp · April 24, 2014, 12:55pm

Hi Thor,

thank you very much for that hint - It worked and solved the problem for now. I have to admit that I really wrangled google to tell me a solution but my google fu seems to have weakened a bit lately.

Cheers,

Dennis