Slow Read DB Reader

r_jain · July 28, 2022, 6:38am

Hi All,

I have 5.2 million rows in a table which i am trying to read and then finally bring it to one CSV.
The DB reader has been going on for 20 minutes and it still doesn’t move ahead. What things i can do to fasten up the process?

ArjenEX · July 28, 2022, 7:38am

Hi @r_jain

That’s more a db management question than a KNIME question I’m afraid. Questions that come to mind:

What kind of db is it?
What is your query like? How many columns, do you have any joiners, etc.?
Are you on the same server as the db or connecting through VPN, IPSec tunnel, etc.
What’s the memory allocation of the db?

From experience with large datasets that also join several tables I noticed that KNIME appears to be not moving, but it’s then “pre-reading” all the data from the different tables/establishing the joiners.

After that’s done, it actually starts to load all rows, which is indicatied by a row count if you hover over the Query Reader.

mlauber71 · July 28, 2022, 7:57am

@r_jain depending on your database you could try and read the data in chunks and append it to a CSV file

r_jain · July 28, 2022, 8:26am

Thanks,

I have a impala database.

r_jain · July 28, 2022, 8:27am

Impala database
Fetching data from a view. It is 20 odd columns and there are no joins.
I am actually connecting through VPN.
I was thinking to rather host the workflow on the knime server and then try as that might speed up the process.
This usually did the trick when i was working with alteryx.

ArjenEX · July 28, 2022, 9:00am

Thanks!

Running on the KNIME server definitely helps if it has more resources available than your own workstation (and the db itself is not the bottleneck).

I had the same issue with querying view’s in SAP, which where a living nightmare to use. Based on some documentation it appears Impala has a similar functionality. With a subject matter expert I eventually got the SAP source tables that were feeding the view and directly queried those. It worked like a charm afterwards.

If possible, I would try a similar exercise to circumnavigate the view.

kienerj · July 28, 2022, 9:22am

What matters a lot is location of your computer making the call and the database. Or said otherwise latency.
In a LAN with latency of 0.01 ms, stuff runs really fast. go to the cloud with 100 ms latency and it will run 1000x times slower.

mlauber71 · July 28, 2022, 11:43am

@r_jain with big data systems much might depend on the structure. You might want to learn about the partitions and maybe iterate over them and also make sure the statistics of the table are up to date.

Also if you can use filters best based on partitions you should use them.

system · October 26, 2022, 11:44am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.