Table Writer very slow compared to Parquet Writer, but it seems irreplaceable.

havarde · September 30, 2020, 6:42pm

We are using KNIME 4.2.2 to load some relatively large tables from a database and store it locally on a Mac for later processing. The larges of these files have around 32 million records and ends up just around 1 GB when written to a table file using the Table Writer.

Working to optimise this job we noticed that it spends very long time writing the file to disk using the Table Writer. When testing with the Parquet Writer it reduces the time to about 1/3rd. The Parquet files also turns out to be smaller so the compression is better (using SNAPPY).

Is there a bug in the Table Writer that slows it down? Is there any way to speed it up?

It would be easy for us to use the Parquet Writer instead of the Table Writer. However, this ETL job is required to write out files for empty tables and for some reason the Parquet Writer does not write a file if you send it an empty table. It does not fail, but no file appears.

Is there any way to get the Parquet Writer to create a file containing the structure of an empty table?

Any other alternatives?

We are not able to use .CSV files as we need to ensure that we are preserving the correct datatypes on all fields, and we have had problems as the CSV file reader interpretes the data to get the data types instead of preserving the Knime data types. We are reading 50+ tables from about 20 different instances of the same database application.

mlauber71 · September 30, 2020, 9:05pm

@havarde if you want to write data to disk and want to preserve the data formats one idea could be to use SQLite or H2 local databases, they would allow empty tables and you could append data in chunks or by streaming and see how that does work.

Another option to explore could be ORC although I this might also not support empty tables. There was some debate about introducing that - but I am not sure about the result.

beginner · October 1, 2020, 4:26am

Whats the issue in your use-case not having this empty table? I ask because that way you can create a workaround. Empty table switch -> csv writer that writes empty file with appropriate name (I hope since it’s empty encodings etc shouldn’t matter but maybe I’m a bit optimistic). If that doesn’t work use a python snippet to write an empty file in appropriate format.

ipazin · October 1, 2020, 11:59am

Hello @havarde,

What kind of processing? I mean question is why not process it right away?

What does it mean long time? Seconds, minutes, hours? I have tried to write down 10 million rows and 5 columns with it and lasted less than 1 minute.

Don’t think there is a bug and maybe there is way to speed it. In steps before your write data to your local machine do you by any chance generate longer RowIDs by using Joiner or Cross Joiner node? If so this might explain longer execution time.

Br,
Ivan

havarde · October 1, 2020, 8:53pm

Thanks for good suggestions @mlauber71. I did not know about the OCR file format, but for my purpose it works in the same way as the Parquet nodes and with the same performance as well as the fact that it does not write a file when the table is empty.

The reasons for using files instead of a local database is simplicity and that we do not need any functionality from the database. What we want to do is to just read the data in one large chunk and handle it in memory as much as possible.

havarde · October 1, 2020, 9:09pm

Thank you for your reply @ipazin

What we are doing is to store the files in a sort of staging area for later preparation of data structures that we use for reporting and analysis. Sort of a low cost file based data warehouse. Locating all the database access in one place and separating it from all the jobs that later work on the same data files is just a convenient way of organising this.

I’ve run some tests using a slightly smaller (26M rows of 115 columns) version of the table we are struggling with. Writing this to disk using the Table Writer takes about 25 minutes on my Mac (brand new MacBook Pro 16 inch with 32GB RAM). Writing the same table to disk using the Parquet Writer takes about 10 minutes, and using the ORC Writer it takes about 8 minutes. Unfortunately reading the files back is also slow, but about 20 minutes for all the different file formats mentioned. The strange thing here is that Table Writer and Parquet Writer works differently as I thought KNIME used Parquet for its Table files (when you select that).

Anyway the over all fastest write and read performance is obtained by writing a plain old CSV file and reading it with the new CSV reader node. With CSV write takes about 10 minutes and read 8 minutes. Not bad, even if it generates a file of about 9 GB. Unfortunately CSV does not create empty files where the empty columns data types are preserved, and may even mess up datatypes on files with data.

Right now the best workaround seems to be to build a solution where I write the table structure into one file and then store the data in a separate CSV file.

mlauber71 · October 1, 2020, 9:20pm

You could experiment with different compression settings and see if that give you a good balance of speed and size.

In this scenario a H2 database would work just like a file but with the additional benefit of a -well- database. You could experiment with chunk sizes and see if this would help. The format is quite stable.

beginner · October 2, 2020, 4:46am

SQLite can actually be used like a file format and all in-memory is totally possible as well. Reading what you write I think it would actually be very suitable for your use-case and rather simple.

SQLite is not the perfect application file format for every situation. But in many cases, SQLite is a far better choice than either a custom file format, a pile-of-files, or a wrapped pile-of-files. SQLite is a high-level, stable, reliable, cross-platform, widely-deployed, extensible, performant, accessible, concurrent file format. It deserves your consideration as the standard file format on your next application design.

Yeah some market speech but then reading your next post about csv files and trying to store the type information you would essential build a simple datatable format why not use SQLite instead?

I would at least try it. Don’t know about the write speed.

But again to my initial question why is it such a problem not to have the empty files? That sounds like a design flaw in the whole logic. If it’s really important to know about empty files make a tracking file (or sqlite database?) that contains info about all files with their row counts etc. That way you have that info available and can use parquet writer.

system · April 2, 2021, 4:46pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.