Fetch huge mysql table

gujodm · April 7, 2019, 2:13pm

Hi, I was wondering if some can suggest me a smart way for fetch a huge mysql table from Database Table connection reader node.

Maybe there is a good workaround for fetch a specific chunk of row per time?

Thanks in advance.

quaeler · April 7, 2019, 2:43pm

Please search the forum for “fetch size”

gujodm · April 7, 2019, 2:46pm

Thanks @quaeler for the reply,

you mean something like: Dknime.database.fetchsize=XXXX?

quaeler · April 7, 2019, 2:50pm

Ya - if you add a larger specification to your knime.ini file and then start KNIME, it will attempt to pull that many rows in each fetch transaction.

qqilihq · April 7, 2019, 2:59pm

It’d be helpful to know what you’re actually doing with the data. Do you really need all of it in a table at once? Otherwise: Could you just page/stream over portions?

– Philipp

gujodm · April 7, 2019, 3:29pm

Hi @qqilihq, yeah, also page/stream would be good. More or less the point is that I don’t want to fail by querying all the data together in 1 time. Can you show me an example of workflow for query per page or chunk as workaround?

Cause the problem occurs when I try to process data (let’s say 5 mln of data) from Database table selector to Database connection table reader nodes.

qqilihq · April 7, 2019, 3:57pm

Sorry, I don’t have any WF which I could publicly share, but I did something similar a while ago and the concept was to build the DB query dynamically within a Flow Variable and pass it to the corresponding Database Query node. Depending on your data, the query could be e.g. as simple as

SELECT * FROM table 
ORDER BY id 
LIMIT $$OFFSET$$, 100

Use e.g. a loop and update $$OFFSET$$ within each iteration, break the loop when there’s no more data. Process all the data within the loop and do not return it (elsewise there’ll obviously be no benefit, as you’ll end up with a huge table nonetheless).

However, you say …

don’t want to fail by querying all the data together in 1 time

I’m not sure I understand correctly, but if you have very strict atomicity constraints, this approach is probably not the right one, as you’ll obviously be doing multiple queries in sequence.

Still, I’m not sure whether this is going in the right direction and for giving better advice, it would be necessary to understand what kind of operations you’re actually doing with that data.