Parsing "table selector" query on Hive crashed HIVE Distribution

Hello,

I have a Problem with a HIVE table selector.
If I select this table with the table selector and execute the node, HIVE tried to parse a query as:

SELECT * FROM tablexyz

Parsing this query causes a HIVE Crash on our HIVE distribution because tablexyz is too big (at the moment 90 TB).

A workaround is to give a custom query with WHERE criteria and it works but the risiko that someone uses the Table Selector for this table is really big.

Is there a possible to avoid that table seletor on executing, starts a parse-request as
select * FROM tablexyz ?

I saw in this post

that there is a possibility with a “query optimizer” to remove select * FROM tablexyz. Is it also a work around ? Is the “query Optimizer” a node or an extension ? On the web I can’t find anything abut KNIME quer optimizer…

How is your experience with it ?

Thank you,
Roberto

One option that I see would be to encapsulate a Table selector node in a MetaNode or a component and offer the possibility to enter the table one wants to select via String Input or String Configuration. Then you could capture the name “tablexyz” or append a “LIMIT 100” after the table name.

But of course if someone would just use a table selector and manually select you forbidden table there is no obvious way to stop them. With Big Data you always have to be careful; if you try to download and select such a table you might break your system.

Hi,

this is also a workaround we follow. Only for your information a LIMIT condition doesn’t help because Hive tried anyway to parse the query on ALL partitions and crashed.

In order to avoid this Problem you must use a Criteria to reduce the Number of HIVE-Partitions which will be checked.

Thank you anyway for your help,
Roberto

Another idea could be that you would first check the available partitions and bring that as a Flow Variable to the Metanode and use that as a first filter. Typically that should reduce the load.

Also it does make sense to provide Hive with statistics about the table via Impala (yes it sounds strange but Hive can read but not create statistics). You could also just compute INCREMENTAL stats for (new) partitions because the computation of statistics also needs some power (especially if your tabel is as large as you mentioned).

Hello @tebaldir,
did you find a solution for your problem? I wonder why the DB Table Selector node causes HIVE to crash. When executing the node itself it only sends a meta data query a la “SELECT * FROM tablexyz LIMIT 0” to the cluster to get the column names and types but does not request any data rows. So ideally Hive should only look at the metadata and return it without starting any compute job.
To see all the queries that are send from KNIME to the DB you can enable the JDBC logging via the Advanced tab.
Bye
Tobias

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.