Hi I am an Alteryx user looking to switch over to Knime and have some questions about file handling. In Alteryx, the best type of file for storing data in is the YXDB type file, and Alteryx does fairly well reading and writing these types of files. I noticed that in Knime, it seems that a CSV is the most common file type used for read and write. However, since I use databases with sometimes hundreds of millions of rows, storing this in a CSV is very inefficient. Also, I also noticed that the csv read speed is much slower in Knime than in Alteryx (8.6 seconds in Alteryx vs 44 seconds in Knime for a 157MB CSV). Furthermore, a 54MB YXDB in CSV form takes up 157MB. Are there any file types that KNIME handles that would a) speed up the read time and b) take up less space in storage than a CSV?
As you’ve noticed, KNIME can read many different types of files, including CSV. However, the most efficient form of flat file storage in KNIME is the native .table format, which is a proprietary format optimized for speed and file size. You can write/read these files using the corresponding Table Writer and Table Reader nodes.
Thank you, this is exactly what I needed! After bench-marking, the same 157MB CSV is now only 44MB in a .table which is smaller than Alteryx’s 54 MB yxdb. The read speed was 6.9 seconds for .table which is much faster than a CSV in KNIME. Alteryx still seems to be faster for read as their YXDB reads the same data in 0.7 seconds, but this is still a significant performance increase from CSVs and the storage cost is less so this will work for me!
you could also try the Parquet file format which is widely used with large data sets. Also KNIME allows to use this format for its internal data storage although it is currently marked ‘experimental’ and I have seen some issues with it it might be worth exploring.
But @Pianalyst would still have to read his csv file in KNIME through a CSV/File Reader and then transform it into a .table, right? I mean, in this case the Table Writer/Reader helps with storing files that will be used as inputs to other workflows later on, but if the information is coming from a source that doesnt enable him to already extract it as .table (and hed have to decide upon more “traditional” formats such as .xls or .csv) then he’d still have to read it through a CSV reader for instance, and then transform it… or do you see another solution?