Incremental/online learning?

Jay · May 16, 2007, 3:14am

.

unknown_user · May 16, 2007, 8:33pm

Well - we do use KNIME on 17mio+ rows of 100+column data. Since KNIME does not hold data in memory, this is not a problem.
However, that's not what you are asking, I guess :-)

We have no updating mechanisms just yet. I guess with some of the features that the new workflow manager will add this should be easier to do. But I do not see anything in our node pipeline that would pick this up anytime soon...

Sorry. Send over a few more PhD students - eh: KNIME developers and we can talk ;-)

unknown_user · May 18, 2007, 12:23am

Hi Michael,

Thanks for your prompt responses as usual! If anybody else is out there I'd like to hear from everyone else as well.

berthold wrote:

Well - we do use KNIME on 17mio+ rows of 100+column data. Since KNIME does not hold data in memory, this is not a problem.
However, that's not what you are asking, I guess :-)

When you say that Knime doesn't hold data in memory do you mean that no node requires all data to be in memory once in the platform?

Best Regards,

Jay

unknown_user · May 18, 2007, 9:34am

Well, as you can see from the API, you are not allowed random access to your input data - you can only iterate over the data table. We can, of course, not keep anyone from simply iterating over all data and storing in it's own internal array but we heavily discourage this. KNMIE tries to keep small data tables in memory and caches them to your HD once the table becomes too large. There are a couple of mechanisms in place to keep the data inbetween two nodes small - for instance if you add a column, the new cached table will reference the original input table and only store the new data in it's own cache.
This also allows us to load a (partially) executed workflow with all the data in place, btw.

Does this make sense? We are working on a technical report describing these internals in more detail.... Nerdy as we are, new features have slightly higher priority, unfortunately ;-)