Speeding up DB Reader

mlauber71 · August 25, 2022, 3:10pm

@r_jain Big Data systems are a special case and most of the power must come fro the big data system itself.

does the table have partitions? Can you limit to certain partitions early in the process. Since the code optimisation in Impala is very limited this typically might have the greatest effect. Putting partition based WHERE conditions first, also all other conditions that might apply
are the table statistics up to date? Missing table statistics are a major resources waste for big data tables
is the data stored in many small files (maybe across server servers) that might als hinder the performance (There is the SHUFFLE option when creating big data tables)

Then … this is not relevant when retrieving the data but when exploring or writing code. You might check to deactivate “retrieve in configuration” (Microsoft Access Connector Java Heap Space - #3 by mlauber71) in the Impala connector in order to speed up the use of DB nodes (downside is you might not always have the latest columns and you would have to know the structure or retrieve them in advance).

Not sure how familiar you are with big data concepts. I have collected a few functions here to play around with (thanks to Create Local Big Data Environment – KNIME Hub it will also work if someone does not have a big data system at hand)