unfortunately I have to read and process a BIG csv File (165 Mio Rows x 12 Cols) with my tiny tiny laptop with (16 GB RAM) and process the data. So far I read the whole file with the “Simple File Reader” (took about 3-4 hours) and did the rest with a chunk loop with 1 Mio rows each iteration.
I tried streaming, but I have several loops within the my process though it wouldn’t work. Do you have an idea how to speed up my workflow?
The load time sounds reasonable without streaming. I have 10mln rows read in 5 min and just 1 min with streaming including row filter. Loops in KNIME are not very fast, so review the processing logic.
Also look here
It looks odd. Why you read file again? You can filter the table you already read. And what the goal of the iteration process? It may make sense to load data to DB and use SQL or use a standard KNIME node?
Well, I have several files with that file, though I just wanted to find a faster way to load these files.
The upload of the Data into a MySQL DB was a idea too. I tried once with KNIME but the upload was freaking slow with the DB writer node. Probably my fault but thats why I skipped it.
I tried this node, but unfortunately after some hours it crashed with ERROR DB Loader 3:423 Execute failed: Java heap space: failed reallocation of scalar replaced objects
The local big data env is stored on a hard drive with more than 200 GB free space.
When I reduce the data table to 1000 rows it works.
No this does not work for me.
The idea behind my bulleted list is, that I want to read the huge csv file chunk by chunk. So I have to check the numbers of rows within the file prior to get the number of iterations for reading it.
My File has 164.9 Mio rows → with a chunk size of 1.000.000 rows it needs 165 iterations…
I have created an example using the Local Big Data Environment and Hive and EXTERNAL table in order to import aCSV file into Hive and later be able to handle it. In this example there is a partition marker (my_partition) in the last column. This can later be used to split the data into parquet partitions so the system will be better able to handle them. Also you might be able to access the files separately if you absolutely must and the system is successful.
One trick is at the point where an Hive EXTERNAL TABLE is fed the structure of your data and then it can access the CSV file(s) as external information and process them within the hive/big data environment. There would be no need to import them at first. But please note: there is no free lunch here. A a certain point the data will have to be processed. One hope might be, that Hive is quite a robust system.
Also: the data might not be able to ‘live’ in this local environment forever. So at a certain point you might want to export the data into Parquet or ORC files (or KNIME’s own table) - or a local SQL database in order to be able to work with it further. Also this is a demonstration on my Mac. So some path nodes might have to be adapted (although it should work on windows also). And also a real big data system would be best to handle that …
You can further try and tweak your performance for example by using the new columnar storage (also based on Apache’s Parquet format).
As already has been said: Don’t use loops. 165 million rows is pushing it anyway and then with loops. Overthink your processing logic.
Use better hardware. Shocking but in the big picture it’s often a very cheap solution compared to alternatives like your time spent on solving it on “mediocre” hardware. For sure one I would prefer before Big Data environments which doesn’t help you anything it just runs on your same laptop. In fact it probably just makes it slower because you then have more stuff running. Looking into big data technology should really only happen if a single machine can’t handle it anymore which takes a lot of data with modern hardware. You can get a 64-core workstation with like 256GB of RAM for around 10k-15k or use AWS. All simpler than an actual big data environment.
Loading it into a database can make sense depending on what you want to perform. Note: You can load it directly into one from a CSV, no need to use KNIME for that.
Having said that for pure number crunching, KNIME is pretty slow compared to say numpy/pandas. Not really a great answer for the KNIME forum but droppign down to pure python based workflow is another option.
Oh and what I forgot, what are the 12 columns? Just numbers? Long strings? or even images? Eg. the size of the data will also greatly matter at this scale.
Thanks a lot for your comments.
The table contains luckily no pictures, but double, some integers and one string column. But you are right, it always depends on the data.
I am not sure, but I thought that Knime has around the same speed as python/pandas. But I didn’t ran any benchmarks…
Of course better hardware is always a easy way to get faster results… But … I work in a huge company with more than 200 k employes worldwide with highly standardized computer hardware. Which has a lot of advantages and some disadvantages, like we seldom have the latest hardware. For example my 600 € laptop at home (Lenovo Ideapad 5 with a Ryzen 5 CPU, 16 GB RAM) is much faster as most of the hardware here at work…
Overthinking the loops in my workflow is probably the best way to improve the performance. We will see.
In the meantime I managed to push the table to a MySQL DB, which helps a lot. Especially when other colleagues want to analyze some parts of it.
Yeah, I expect the bigger the company the worse this bureaucracy and standardization gets. I would still push for better hardware, best include all your colleagues and even other teams that need something better. But yeah will take years, probably, to get there.
Also don’t get fooled by the $600 price tag. Depending on the exact model, these AMD Ryzen mobile CPUs are actually pretty much the fastest mobile CPUs available right now