I have a Problem with a HIVE table selector.
If I select this table with the table selector and execute the node, HIVE tried to parse a query as:
SELECT * FROM tablexyz
Parsing this query causes a HIVE Crash on our HIVE distribution because tablexyz is too big (at the moment 90 TB).
A workaround is to give a custom query with WHERE criteria and it works but the risiko that someone uses the Table Selector for this table is really big.
Is there a possible to avoid that table seletor on executing, starts a parse-request as
select * FROM tablexyz ?
I saw in this post
that there is a possibility with a “query optimizer” to remove select * FROM tablexyz. Is it also a work around ? Is the “query Optimizer” a node or an extension ? On the web I can’t find anything abut KNIME quer optimizer…
One option that I see would be to encapsulate a Table selector node in a MetaNode or a component and offer the possibility to enter the table one wants to select via String Input or String Configuration. Then you could capture the name “tablexyz” or append a “LIMIT 100” after the table name.
But of course if someone would just use a table selector and manually select you forbidden table there is no obvious way to stop them. With Big Data you always have to be careful; if you try to download and select such a table you might break your system.
Another idea could be that you would first check the available partitions and bring that as a Flow Variable to the Metanode and use that as a first filter. Typically that should reduce the load.
Also it does make sense to provide Hive with statistics about the table via Impala (yes it sounds strange but Hive can read but not create statistics). You could also just compute INCREMENTAL stats for (new) partitions because the computation of statistics also needs some power (especially if your tabel is as large as you mentioned).
did you find a solution for your problem? I wonder why the DB Table Selector node causes HIVE to crash. When executing the node itself it only sends a meta data query a la “SELECT * FROM tablexyz LIMIT 0” to the cluster to get the column names and types but does not request any data rows. So ideally Hive should only look at the metadata and return it without starting any compute job.
To see all the queries that are send from KNIME to the DB you can enable the JDBC logging via the Advanced tab.