performance question

niederle · March 21, 2012, 1:20pm

Hi,

I'm developing a node which usually deals with large data tables.

An example table contains 500.000 x 100 (rows x columns) with double values.

The analysis needs to be executed column wise. My first implementation iterates over the whole table for each column to collect and process the data. This takes time and I was wondering wether there is a more elegant way. If I collect the data for processing in one go (iterating only once over the table), it might be faster but then I'm afraid to run into memory problems as I have to duplicate the data in a format which allows to process them afterwards. The first solution just would duplicate the data of one column at the time.

Any suggestion on a rather fast but memory safe execution?

weskamp · March 26, 2012, 11:38pm

Unfortunately, no elegant solution comes to my mind. Instead of deciding between the extremes "one column at a time" and "all columns at once", you could allow for a variable number of columns to be processed per pass. The number of passes could then either be set by the end user or you could try to estimate the optimal number of passes somehow based on the available memory.

In principle, you could simply transpose your input data (i.e., look at the code of the Transpose-node), but I am not sure if a BufferedDataTable with 500.000 columns would be efficient. What about implementing your own column-oriented data structure for external memory? Just open 100 temporary files (one for each column) and write the input table to these files in one pass. Then, you can access all columns/files as you like. Might be a lot of work, but should certainly be useful also for other nodes.