Read and process HUGE csv File

Hi,

unfortunately I have to read and process a BIG csv File (165 Mio Rows x 12 Cols) with my tiny tiny laptop with (16 GB RAM) and process the data. So far I read the whole file with the “Simple File Reader” (took about 3-4 hours) and did the rest with a chunk loop with 1 Mio rows each iteration.

I tried streaming, but I have several loops within the my process though it wouldn’t work. Do you have an idea how to speed up my workflow?

Andreas

1 Like

The load time sounds reasonable without streaming. I have 10mln rows read in 5 min and just 1 min with streaming including row filter. Loops in KNIME are not very fast, so review the processing logic.
Also look here

3 Likes

Thanks!
I tried to read the CSV File chunk-wise (1 Mio Rows) like following:

  • Get Number of Rows in File (I use a python snipped)
  • Calculate number of iterations ceil(N_Rows/1e6)
  • feed as variable into Counting loop Node
  • calculate start row as variable for File Reader: RowStart = (iteration*1e6)
  • Feed Variable into File Reader, set as Start Row

This Works pretty good until iteration 122… Idk why but the File Reader creates an empty table.
Maybe the node is not able to ignore more than 121 Mio rows…

It looks odd. Why you read file again? You can filter the table you already read. And what the goal of the iteration process? It may make sense to load data to DB and use SQL or use a standard KNIME node?

1 Like

Well, I have several files with that file, though I just wanted to find a faster way to load these files.
The upload of the Data into a MySQL DB was a idea too. I tried once with KNIME but the upload was freaking slow with the DB writer node. Probably my fault but thats why I skipped it.

Try

2 Likes

Also,

Could be used instead of snippet.

2 Likes

I tried this node, but unfortunately after some hours it crashed with
ERROR DB Loader 3:423 Execute failed: Java heap space: failed reallocation of scalar replaced objects
The local big data env is stored on a hard drive with more than 200 GB free space.
When I reduce the data table to 1000 rows it works.

No this does not work for me.
The idea behind my bulleted list is, that I want to read the huge csv file chunk by chunk. So I have to check the numbers of rows within the file prior to get the number of iterations for reading it.
For example:
My File has 164.9 Mio rows → with a chunk size of 1.000.000 rows it needs 165 iterations…

This is the question to @tobias.koetter.

I have created an example using the Local Big Data Environment and Hive and EXTERNAL table in order to import aCSV file into Hive and later be able to handle it. In this example there is a partition marker (my_partition) in the last column. This can later be used to split the data into parquet partitions so the system will be better able to handle them. Also you might be able to access the files separately if you absolutely must and the system is successful.

One trick is at the point where an Hive EXTERNAL TABLE is fed the structure of your data and then it can access the CSV file(s) as external information and process them within the hive/big data environment. There would be no need to import them at first. But please note: there is no free lunch here. A a certain point the data will have to be processed. One hope might be, that Hive is quite a robust system.

Also: the data might not be able to ‘live’ in this local environment forever. So at a certain point you might want to export the data into Parquet or ORC files (or KNIME’s own table) - or a local SQL database in order to be able to work with it further. Also this is a demonstration on my Mac. So some path nodes might have to be adapted (although it should work on windows also). And also a real big data system would be best to handle that …

You can further try and tweak your performance for example by using the new columnar storage (also based on Apache’s Parquet format).

I have to go for now. More later…

5 Likes

Wow! That is insane!!!
Thank you very much!

2 Likes

Regardless of @mlauber71 comprehensive example, I forgot to mentioned that

optimized for speed and may work for you also.

3 Likes

As already has been said: Don’t use loops. 165 million rows is pushing it anyway and then with loops. Overthink your processing logic.

Use better hardware. Shocking but in the big picture it’s often a very cheap solution compared to alternatives like your time spent on solving it on “mediocre” hardware. For sure one I would prefer before Big Data environments which doesn’t help you anything it just runs on your same laptop. In fact it probably just makes it slower because you then have more stuff running. Looking into big data technology should really only happen if a single machine can’t handle it anymore which takes a lot of data with modern hardware. You can get a 64-core workstation with like 256GB of RAM for around 10k-15k or use AWS. All simpler than an actual big data environment.

Loading it into a database can make sense depending on what you want to perform. Note: You can load it directly into one from a CSV, no need to use KNIME for that.

Having said that for pure number crunching, KNIME is pretty slow compared to say numpy/pandas. Not really a great answer for the KNIME forum but droppign down to pure python based workflow is another option.

Oh and what I forgot, what are the 12 columns? Just numbers? Long strings? or even images? Eg. the size of the data will also greatly matter at this scale.

Thanks a lot for your comments.
The table contains luckily no pictures, but double, some integers and one string column. But you are right, it always depends on the data.
I am not sure, but I thought that Knime has around the same speed as python/pandas. But I didn’t ran any benchmarks…
Of course better hardware is always a easy way to get faster results… But … I work in a huge company with more than 200 k employes worldwide with highly standardized computer hardware. Which has a lot of advantages and some disadvantages, like we seldom have the latest hardware. For example my 600 € laptop at home (Lenovo Ideapad 5 with a Ryzen 5 CPU, 16 GB RAM) is much faster as most of the hardware here at work…

Overthinking the loops in my workflow is probably the best way to improve the performance. We will see.
In the meantime I managed to push the table to a MySQL DB, which helps a lot. Especially when other colleagues want to analyze some parts of it.

2 Likes

Yeah, I expect the bigger the company the worse this bureaucracy and standardization gets. I would still push for better hardware, best include all your colleagues and even other teams that need something better. But yeah will take years, probably, to get there.
Also don’t get fooled by the $600 price tag. Depending on the exact model, these AMD Ryzen mobile CPUs are actually pretty much the fastest mobile CPUs available right now

@ActionAndi did you get the chance to try the New Columnar Table Backend? Already with the current implementation of CSV Readers we saw some speed-up when reading large files.

4 Likes

I’ve never heared about that. I’ll give it a try :slight_smile:
Thank you

2 Likes

If money isn’t a huge problem, you can look for a SaaS solution like a snowflake or acho studio for processing larger csv files. I paid the latter for a couple of months and it worked pretty great

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.