Is "Distribution" something special in KNIME?

Edlueze · December 5, 2014, 5:10am

I don't know what to make of this latest problem I encountered. I created a simple class called "Distribution" into which I wanted to load column data from the InPort. I then use a RowIterator to iterate over each row in order to extract the data. Everything was looking fine until suddenly the performance of the RowIterator plummeted to a ridiculously slow pace (though still produced the correct output). Out of pure frustration, I came up with this test code:

//Iterate down InPort[1] -- 10,000 rows takes 2 seconds
RowIterator iterator_01 = inData[1].iterator();
while( iterator_01.hasNext() ) {
    DataRow row = iterator_01.next();
}

m_distributions.put("MyDistribution", new Distribution() );

//Iterate down InPort[1] again -- 10,000 rows now takes 5 minutes!!!
RowIterator iterator_02 = inData[1].iterator();
while( iterator_02.hasNext() ) {
    DataRow row = iterator_02.next();
}

public final class Distribution {
    private String m_name;
    private int m_index;
    ... more Strings, ints, doubles, and a single double[~10000]
}

[Actually not quite this simple as the Distribution was the Value in a HashMap].

After a lot of banging-head-against-wall I finally changed the class name of "Distribution" to "CustomerDistribution". Suddenly everything was back to normal! But I'm baffled - there doesn't seem to be any points of commonality between my class and KNIME. Is there something happening deep inside KNIME or Java that is saying "hey - I wonder if this 'Distribution' has something to do with that 'Distribution' - I better spend some time checking".

While I'm here, I might as well raise a second question. I am doing a lot of column-wise data crunching, which is why I'm loading all of the data into my own Distribution structures. But I worry that this is not very KNIME-like as it means that I am largely ignoring the power of KNIME until I want to push my results to the OutPort. I tried to find samples of column-wise data manipulation in KNIME, but when I found that a lot of the statistics in KNIME are calculated in the same way (that is, first pull out all the data into a separate array and then work on it) I concluded that KNIME is only designed for row-wise data manipulation.

Is KNIME also good at doing column-wise data manipulation?

wiswedel · December 11, 2014, 11:34am

Hi,

The 2s vs. 5min difference could be due to data getting swapped to disc in between (probably because your Distribution object just let memory hit a magic boundary which causes KNIME tables to be cached out to disc). KNIME prints some information to the log file when that happens; here is an example:

2014-12-07 06:39:38,132 DEBUG Service Thread MemoryObjectTracker : Low memory encountered. Used memory: 998 MB; maximum memory: 1 GB.
2014-12-07 06:39:38,146 DEBUG KNIME-Memory-Cleaner MemoryObjectTracker : Trying to release 41 memory objects

An iterator over an in-memory table is significantly faster than one reading from file-cache. This is completely transparent to your implementation -- you won't know how the table is backed (except when you check the concrete class of the iterator, which doesn't seem to be good coding practice.)

The class name is not relevant - the differences you noticed with "Distribution" vs "CustomerDistribution" were just coincidence, I think.

Your second question: The KNIME table iterator is row-based so while you iterate it's also reading bytes associated with other columns (though it doesn't mean it's interpreting them).

To my knowledge none of KNIME's statistics nodes require data to be in memory (not read into an array or so). It would be quite a limitation if, for instance, a regression learner would require the data to fit into memory. Many of those nodes make use of org.knime.base.node.viz.statistics2.Statistics3NodeModel (or with 2.11 org.knime.base.data.statistics.StatisticCalculator)

Hope this helps!

Bernd