Is there any limit on how many rows a DataTable can hold?

jdu · July 27, 2007, 10:43pm

We built a custom reader that reads stream outputs from a command line execution. Each line is stored in a row at the Datatable. But if my reader reads more more than 1,000,000 rows, then I got an error from the org.knime.core.data.container.Buffer class, I think it was from the getFileName() method, saying index out range exception.

I do notice when I run my node, up to 1 million files are created in my Document and Settings directory (I ued windows XP), all these are gz files about the size of 1K that contain the content from the stream output.

Is there a limit set for DataTable? Or is there something wrong in my implemntation? I wish I can run more than one million records.

Thanks for any comments.

Josh

unknown_user · July 28, 2007, 1:39pm

There is no limit on the number of rows - as long as your Harddrive has enough space, that is. I am puzzled by your report of one file being created for each row, however. That should not happen. Do these rows hold BLOB objects by any chance (molecules...) or simple DataCells (double, ints....) only?

PS: We did run over about 20 million rows before so we know it can work. There must be something specific in your setup which causes this problem.

- Michael

unknown_user · July 28, 2007, 2:31pm

The rows do hold BLOB object (PffCell, yes a molecule) that extends BlobDataCell.
Here is a snippet of the code:

DataColumnSpec[] allColSpecs = new DataColumnSpec[1];
allColSpecs[0] = new DataColumnSpecCreator("PCF or PFF", PffCell.TYPE).createSpec();
DataTableSpec[] outputSpec = new DataTableSpec[1];
outputSpec[0]= new DataTableSpec(allColSpecs);
BufferedDataContainer container = exec.createDataContainer(outputSpec[0]);

RowKey key = null;
DataCell[] cells = null;
DataRow row = null;
DataTable dataT= null;
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
int row_count = 0;
while ((line=br.readLine()) != null) {
key = new RowKey("Row " + row_count);
cells = new DataCell[1];
cells[0] = new PffCell(line);
row = new DefaultRow(key, cells);
container.addRowToTable(row);
row_count+= 1;
}//while

br.close();
container.close();
dataT = container.getTable();

unknown_user · July 28, 2007, 3:36pm

Ah, than this could be a problem with the BLOB containers - Bernd, the master of the BLOBs in KNIME is out of town right now (I know, I know, never let people take off before they haven't fixed all bugs...) so I'd like to ask you to be patient until Wednesday. Hope that's ok, otherwise I can look into this as well but I have a feeling it'll take me 100 times as much time and I will have my head chopped off when Bernd sees my "fix"...

- Michael

unknown_user · July 30, 2007, 4:53am

Thanks Micheal.

For now, I will just replace the BLOB object (custom molecule PffCell type) with String object (StringCell type). When StringCell type is used, indeed, there is no limit on how manay rows the reader can process. Interestingly, the read rate is also about 2 times as fast (vs PffCell).

-Josh

unknown_user · July 30, 2007, 8:41am

Hi Josh!

You can also change (for now) the implementation of your PffCell to extend DataCell instead of BlobDataCell - that way they will be treated the same way as other StringCells and none of the underlying Blob-files will be created.

However, note that once you change this back, old workflows with their data files won't load correctly so you can only load the workflow structure but not (partially) executed flows with the corresponding data.

A factor of two is not soo bad, considering that we need to create extra storage "containers" for each cell. You will notice a massive benefit, however, once you pipe this kind of data through a longer series of nodes - the Blob-Cells will really only be created once, which is not (can not be) guaranteed for normal data cells. So if you compare the normal DataCell implementation with a Blob-based one on a longer pipe you should see an improvement both in terms of speed and HD space.

Cheers, Michael

unknown_user · July 31, 2007, 5:00am

Thanks again Michael!

Yes, I can even retain the Pff type and the gif associated with it if I extend the DataCell. Performance wise, StringCell and PffCell (extending DataCell) are basically the same. As for DataCell vs BlobDataCell, I did some more extended benchmark today - the rate is actually about 3.5 to 1 instead of the 2 fold I reported earlier.

Good to know that BlobDataCell can be "reused" in longer pipes. I will definite try that out later.

Cheers,

-Josh

unknown_user · August 1, 2007, 10:59am

Hi Josh,

there is (eh, was) indeed a bug preventing the user from creating more then 999,999 blobs per column. The problem is caused by the fact that we do not put more than 1,000 files per directory (if you go into your temp directory and look for a folder called 'knime_container_200707ab_xyz' you will find the blob files put into separate sub-folders); the method that assembles the directory path had a bug.

This bug could be relatively easy fixed (it's just one file, no dependencies to other files). If you wish to continue to use blobs right now, I can make that fix available to you (I would then send you the .class files, which you would use to replace another file within a jar in your knime installation). Otherwise, if there is no urgent need right now, I suggest to wait until the next release. The next version of knime will also fix some other problems regarding blobs. Resetting of nodes that produce blob cells will speed up considerably (the deletion of those individual cells takes quite some time - although this is a system call).

Thanks for the problem report!

Best regards,
Bernd

unknown_user · August 1, 2007, 3:41pm

Hi Bernd,

Thanks for the anwer. No, there is no urgent need for us to use BlobDataCell right now. We can wait for your next release to use the blobs for large datasets.

Thanks,

-Josh