Parquet Reader hangs when reading very wide file

bfrutchey · February 8, 2020, 4:08pm

In Knime 3.7.2 and 4.1.1 on Windows 10 I am creating a table with >100k columns (10k base features, each repeated multiple times in other rows with a variety of lags - seems like a great use case for the wide column vector support I am starting to see nodes for). Parquet writer can write this table to a file, but the Parquet reader hangs when trying to read it. I haven’t had the patience to allow the reader to run for more than a few hours, so perhaps it will ultimately succeed, but it is faster to recreate the columns from the source data than read in the finished data file. Any ideas what could help speed the reader along aside from splitting columns out into different files? Is this a parquet limitation or a KNIME one? Thanks for the thoughts!

mareike.hoeger · February 10, 2020, 9:47am

@bfrutchey,

I am not sure whether this is a restriction of KNIME or the Parquet library, I have to test this with similar data first. How many rows has your table and which column types do you have?

best Mareike

bfrutchey · February 10, 2020, 1:11pm

110,000 columns, most are doubles - only 10 string fields. My test was with only 300 rows - I haven’t tried fewer yet.

NDekay · April 24, 2020, 3:03pm

Hello,

You mentioned using KNIME 3.7.2 and 4.1.1 on Win10 with a table that has 100k+ columns, with the issue that the Parquet reader hangs when trying to read in the file.

I am looking into whether this long read time may be an issue of file size, Parquet limitation, or KNIME limitation.

In the meantime, are you able to successfully use the Parquet reader on smaller test files/tables to verify whether or not it is a matter of table size?

Thank you,
Nickolaus

bfrutchey · April 24, 2020, 8:55pm

Nickolaus, we are able to load Parquet files with fewer columns, even when they are larger in size. The number of columns seems to be the kicker.

julian.bunzel · April 30, 2020, 12:33pm

Hey bfrutchey,

can you provide any other information about your workflow? I created a file with 140k columns and 500 rows. All doubles and one String column. I can read and write without any problems, so there doesn’t seem to be a limitation based on number of columns.

Cheers,
Julian

mlauber71 · May 1, 2020, 4:39pm

You could try and use R or Python to import the data and see if this makes any difference.

NDekay · May 5, 2020, 4:13pm

Hello bfrutchey,

As this has been tested with larger # of cols and rows of similar table makeup, it seems that it may not be a limitation of KNIME or of the libraries used, but may rather be an issue with resources available on the system at the time.

Please continue to test with other tables, and if you experience further difficulties, you may consider expanding the resources available for use to resolve the issue of the large table import.

Regards,
Nickolaus

system · November 4, 2020, 4:21am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.