thrift.transport.TTransportException BIG DATA EXTENSIONS

Hi all,

Sorry I haven't found the big data subfolder.

We encounter a problem with the hive connector (same behaviour with the database connector with hive JDBC). We work with windows 7 and a cluster built on Cloudera 5.4.

I you let your connection inactive for a short moment (changing the knime timeout parameter has no effect on it) you obtain an error lauching a query :

ERROR Database Reader 3:78 Execute failed: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset by peer: socket write error

1 - If you reset your database reader or your database table selector, the error remains.

2 - If you reset the Database or Hive connector and relaunch it. It becomes green with no problem, but the problem remains with the folowing node (database reader or your database table selector).

3 - the solution that works : close and restart Knime. Not very practical....

We tested the 3 step process 10 times so this is a systematic behaviour.

(and of course everything is allright in the meantime on our cluster where we can do the same queries via hue).

Somebody as an idea ?

Best regards,

Fabien

Hello Fabien,

If I understand you correctly you can work with the database as long as you are constantly executing KNIME nodes e.g. sending queries to the db but as soon as you wait a certain time you encounter a Connection reset by peer: socket write error? If that is the case it seems the connection gets invalid by the Thrift server after a certain time of idleness. Do you have access to the Thrift server log file to get some more information about the problem? Do you connect to a secure or unsecure cluster e.g. with LDAP/Kerberos based authentication enabled?

Restarting KNIME solves the problem temoprary since a new connection is created. Reexecuting the Connector node does not solve the problem since the db connections are cached to speedup the processing. Also using several connector nodes won't solve the problem since only one connection per user name and db is used regardless of the number of connectors. Currently the connection is only recreated if it KNIME detects that the connection is no longer valid which it doesn't seem to do correct in your case.

Bye,

Tobias

Thanks for your reply Tobias. I will ask more details to our cluster guys. Hope I could give news soon.

Happy new year to you !

Hi all,

We haven't found a solution yet. The connection cuts itself regularly. Has anybody already encountered this problem with Cloudera ?

All our cluster parameters are at 0 so the connection therorytically never expires :

hive.server2.session.check.interval

•Default Value: 0ms

•Added In: Hive 0.14.0 with HIVE-5799

The check interval for session/operation timeout, which can be disabled by setting to zero or negative value.

hive.server2.idle.session.timeout

•Default Value: 0ms

•Added In: Hive 0.14.0 with HIVE-5799

With hive.server2.session.check.interval set to a positive time value, session will be closed when it's not accessed for this duration of time, which can be disabled by setting to zero or negative value.

hive.server2.idle.operation.timeout

•Default Value: 0ms

•Added In: Hive 0.14.0 with HIVE-5799

With hive.server2.session.check.interval set to a positive time value, operation will be closed when it's not accessed for this duration of time, which can be disabled by setting to zero value.

With positive value, it's checked for operations in terminal state only (FINISHED, CANCELED, CLOSED, ERROR).

With negative value, it's checked for all of the operations regardless of state

Best regards

Fabien

Hello Fabien,

thanks for the valuable information. We are currently working on improvements to the Hive integration which among others will change the connection handling. The improved version will check for invalid/closed connections and recreate them im necessary.

Bye,

Tobias

Hi Tobias,

I have another question. We are beginning to work with the third Knime version. In order to avoid the problems we had on a previous problem, could you recommend us the Cloudera, Hive or Spark version that should work at best with the Knime Big Data and Spark executors components.

Best regards,

Fabien

Hello Fabien,

for the current KNIME version I would suggest to use CDH 5.4 with Spark 1.3.

However, we will release a new version of the Spark Executor in May which will support Spark 1.5 and 1.6. We will also improve the database driver handling with the next KNIME Analytics Platform release in July to ease the usage of the JDBC drivers provided by Cloudera with the Big Data Connector extension. So once all of this is in place I would recommend to use a more recent version e.g. CDH 5.7.

Bye

Tobias

Thanks Tobias !