Speeding up DB Reader

I am trying to read in 10 million rows. However the process is really slow and i was wondering if i could make some changes in the impala connector like in the advanced settings or something else?
Any pointers?

Thanks
R

Also is there a way to know the progress in % terms. Like i made a group by query so i don’t know how many rows are going to be fetched, is it possible to see how much time the entire process is going to take or how much it has progressed or pending in terms of row count and time?

@r_jain Big Data systems are a special case and most of the power must come fro the big data system itself.

  • does the table have partitions? Can you limit to certain partitions early in the process. Since the code optimisation in Impala is very limited this typically might have the greatest effect. Putting partition based WHERE conditions first, also all other conditions that might apply
  • are the table statistics up to date? Missing table statistics are a major resources waste for big data tables
  • is the data stored in many small files (maybe across server servers) that might als hinder the performance (There is the SHUFFLE option when creating big data tables)

Then … this is not relevant when retrieving the data but when exploring or writing code. You might check to deactivate “retrieve in configuration” (Microsoft Access Connector Java Heap Space - #3 by mlauber71) in the Impala connector in order to speed up the use of DB nodes (downside is you might not always have the latest columns and you would have to know the structure or retrieve them in advance).

Not sure how familiar you are with big data concepts. I have collected a few functions here to play around with (thanks to Create Local Big Data Environment – KNIME Hub it will also work if someone does not have a big data system at hand)

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.