Best File Types for Knime

Pianalyst · August 26, 2019, 7:04pm

Hi I am an Alteryx user looking to switch over to Knime and have some questions about file handling. In Alteryx, the best type of file for storing data in is the YXDB type file, and Alteryx does fairly well reading and writing these types of files. I noticed that in Knime, it seems that a CSV is the most common file type used for read and write. However, since I use databases with sometimes hundreds of millions of rows, storing this in a CSV is very inefficient. Also, I also noticed that the csv read speed is much slower in Knime than in Alteryx (8.6 seconds in Alteryx vs 44 seconds in Knime for a 157MB CSV). Furthermore, a 54MB YXDB in CSV form takes up 157MB. Are there any file types that KNIME handles that would a) speed up the read time and b) take up less space in storage than a CSV?

Thank you for any feedback!

ScottF · August 26, 2019, 7:16pm

Hi @Pianalyst -

Welcome to the forum!

As you’ve noticed, KNIME can read many different types of files, including CSV. However, the most efficient form of flat file storage in KNIME is the native .table format, which is a proprietary format optimized for speed and file size. You can write/read these files using the corresponding Table Writer and Table Reader nodes.

Pianalyst · August 26, 2019, 7:22pm

Thank you, this is exactly what I needed! After bench-marking, the same 157MB CSV is now only 44MB in a .table which is smaller than Alteryx’s 54 MB yxdb. The read speed was 6.9 seconds for .table which is much faster than a CSV in KNIME. Alteryx still seems to be faster for read as their YXDB reads the same data in 0.7 seconds, but this is still a significant performance increase from CSVs and the storage cost is less so this will work for me!

izaychik63 · August 26, 2019, 7:25pm

KNIME has number of databases readers/connectors or uses JDBC to connect. As for speed, KNIME has special mechanism - streaming substantially improving performance.

mlauber71 · August 26, 2019, 8:07pm

you could also try the Parquet file format which is widely used with large data sets. Also KNIME allows to use this format for its internal data storage although it is currently marked ‘experimental’ and I have seen some issues with it it might be worth exploring.

g_oliv · August 26, 2019, 9:21pm

Hi @ScottF,

But @Pianalyst would still have to read his csv file in KNIME through a CSV/File Reader and then transform it into a .table, right? I mean, in this case the Table Writer/Reader helps with storing files that will be used as inputs to other workflows later on, but if the information is coming from a source that doesnt enable him to already extract it as .table (and hed have to decide upon more “traditional” formats such as .xls or .csv) then he’d still have to read it through a CSV reader for instance, and then transform it… or do you see another solution?

Cheers,
Gui

mlauber71 · August 27, 2019, 5:53am

There is a possibility. You could use Hive external table function but there are some quirks as if now in KNIME.

ScottF · August 27, 2019, 1:53pm

Right, but that’s the case with Alteryx as well, if he wanted to get data from a CSV into YXDB format (which is the Alteryx proprietary format).

As far as I can tell, it’s an apples-to-apples comparison.

DemandEngineer · October 30, 2020, 9:52pm

Hi @Pianalyst ,

I also come from using Alteryx Designer, hope this can help you and/or you can help update this tool mapping sheet: