I have a data set with has more than a billion rows of data and about 40 columns. We would be doing analytics on this dataset using Knime Analytics Platform. Before begining the analysis, we will be buying suitable hardware required for processing this data. Have a few questions regarding this?
1. What configuration of hardware do you recommend in terms of RAM, CPU etc. to handle data set of this size?
2. Any advice on how to handle large data set of this size unsing Knime Analytics Platform?
It depends on what exatly you want to do, so it's fairly difficult to come up with recommendations. I'd recommend to start with a reasonable price point machine (maybe a cloud VM) and to scale out as required. Often offloading parts of the processing into optimised databases (or clusters of databases) can speed things up significantly "under the hood" without compromising KNIME's familiar and user-friendly interface.
Hope this helps a little bit!
When we have to deal with really huge number of rows we use sampling (sampled) tables first on average machines.
In case of time series it is difficult but usually there is an opportunity to manage to find the right time window(s).
The power you need is really depends on the workflow you'll create - with sampling you'll have some idea about the resources needed.
The VM / Cloud solution is really scalable, however it is strongly recommended to test the processes first.
How did this work out for you? I am looking to do the same with 100M+ Rows.
Hi there @ScottRJohnson1,
welcome to KNIME Community Forum! Sry for a delay on this one. It got buried down…
Anyways have you tried out KNIME with your data? In my opinion KNIME should have no problems handling 100M+ rows but that also depends on the data types you have, number of columns, and of course manipulations, operations and analysis you are performing with it.
If you have any additional questions feel free to ask and someone will assist you. Hopefully sooner that later
Depends on what you want to do with it. Can you clarify?
Streaming nodes work well for simple operations and are not that memory intensive. If you have something this big that is streamable, definitely do it since it wont bottleneck your memory. I believe the way Knime would handle the dataset specifically for anything not streamable, is that it would chunk the file from the disk to perform operations. Then, the more memory you have on the machine, the less times this swap would have to take place.
That being said, if you have the money, get a proc with a decent number of threads on it so you can and use the Parallel Processing nodes to make quick work of any row-wise data manipulation.
If you are doing something like neural nets, just get a decent base platform like an AMD Ryzen 5 2600 (12 threads for $200), 32 or 64 GB ram, and then get a Nvidea GPU that is CUDA enabled.
Alternatively, if you can push the data to Snowflake, you can use their cloud computing to pre-process the data somewhat before pulling it into KNIME.
Thanks for the responses! Ended up buying a 128GB, 16 Thread (32 Core [?]), Optane SSD for Knime custom built computer. Able to process the 100million+ lines and some; yet have split out larger calculations into 6 partitions+ and then re-aggregrate – seems to be significantly faster than just performing the same operation on the full dataset.
In short – I am one happy camper – and advancing significantly every day in my Knime knowledge and love.
Glad to hear that @ScottRJohnson1
For what it’s worth - this is what I’ve done to run 100 million+ lines of data on KNIME Analytics Platform 4.0.0:
I have a lesser machine than your custom machine: 64gb ram, 8 core / 16 thread, consumer grade nvme ssd, with win 10.
- Install the latest Adoptopenjdk 8 (for me it’s 8.0.212.04-hotspot)
- Switch the GC to shenandoah by changing knime.ini file
- have the jvm point towards latest adoptopenjdk
- switch -XX:+UseG1GC to -XX:+UseShenandoahGC
- add -XX:+UnlockExperimentalVMOptions, -XX:+AlwaysPreTouch, -XX:-UseBiasedLocking, -XX:+ExplicitGCInvokesConcurrent, -XX:+UseNUMA
Hope this helps!
EDIT: I should add, this has stopped my workflows from OOMs
Why does this make any difference, I’m sorry for the dumb question.