XLS Reader uses too much memory

Hi all,

I tried to read a 65MB xls file with the XLS Reader and I got an error (java.lang.OutOfMemoryError: Java heap space).

It seems, that the XLS Reader reads the file two times. After having read it the first time all sheets are available in the config dialog. Then the XLS Reader tries to read the selected sheet and therefore reads the whole file again (at this point I get an OutOfMemoryError, although I use a heap size of 1024m).

The other problem might be that the XLSIterator in KNIME uses "WorkbookFactory.create(InputStream)" instead of "WorkbookFactory.create(File)". According to this discussion "WorkbookFactory.create(File)" uses less memory (http://stackoverflow.com/questions/6069847/java-lang-outofmemoryerror-java-heap-space-while-reading-excel-with-apache-poi). I also have that impression from using poi.

If somebody wants to reproduce the error, I can send him the xls file.

Thanks,

Chris

Hi Chris, 

Thanks for posting, we are aware of this limitation and are interested in finding a solution so thanks for the potential lead.  We are looking into this now and will let you know what is going on as soon as we have more info. 

Best Regards,

Aaron Hart

KNIME.com

I too have experienced this problem even with smaller file. It's to the point that the xls reader is, for practical purposes, not practical.

The configuration dialog window for xls is terribly slow as well, as it seems that there is no way to stop the initial 'refreshing preview table...' from starting, which, until the preview is created or when memory runs out (the latter being the rule rather than the exception), the program not accessible. 

I agree too, the xls reader isn't terribly useful due to its sluggishness and severe memory limitations, and lack of multiple sheet support. Seems even worse when using xlsx files which I find strange as I thought Microsoft made xlsx an open source file format so it should be easier to implement one would think.

simon.

...well, the xlsx format is more open but also a lot bigger - just take a look at the unziped content... Anyway, as Aaron said, we have this on the radar and already a few nodes (almost) ready for the 2.8 release coming out this summer. Reading will be faster and you'll also be able to access/iterate over sheets.

Michael