I have a data set with has more than a billion rows of data and about 40 columns. We would be doing analytics on this dataset using Knime Analytics Platform. Before begining the analysis, we will be buying suitable hardware required for processing this data. Have a few questions regarding this?
1. What configuration of hardware do you recommend in terms of RAM, CPU etc. to handle data set of this size?
2. Any advice on how to handle large data set of this size unsing Knime Analytics Platform?
It depends on what exatly you want to do, so it's fairly difficult to come up with recommendations. I'd recommend to start with a reasonable price point machine (maybe a cloud VM) and to scale out as required. Often offloading parts of the processing into optimised databases (or clusters of databases) can speed things up significantly "under the hood" without compromising KNIME's familiar and user-friendly interface.
welcome to KNIME Community Forum! Sry for a delay on this one. It got buried down…
Anyways have you tried out KNIME with your data? In my opinion KNIME should have no problems handling 100M+ rows but that also depends on the data types you have, number of columns, and of course manipulations, operations and analysis you are performing with it.
If you have any additional questions feel free to ask and someone will assist you. Hopefully sooner that later
Depends on what you want to do with it. Can you clarify?
Streaming nodes work well for simple operations and are not that memory intensive. If you have something this big that is streamable, definitely do it since it wont bottleneck your memory. I believe the way Knime would handle the dataset specifically for anything not streamable, is that it would chunk the file from the disk to perform operations. Then, the more memory you have on the machine, the less times this swap would have to take place.
That being said, if you have the money, get a proc with a decent number of threads on it so you can and use the Parallel Processing nodes to make quick work of any row-wise data manipulation.
If you are doing something like neural nets, just get a decent base platform like an AMD Ryzen 5 2600 (12 threads for $200), 32 or 64 GB ram, and then get a Nvidea GPU that is CUDA enabled.
Alternatively, if you can push the data to Snowflake, you can use their cloud computing to pre-process the data somewhat before pulling it into KNIME.
Thanks for the responses! Ended up buying a 128GB, 16 Thread (32 Core [?]), Optane SSD for Knime custom built computer. Able to process the 100million+ lines and some; yet have split out larger calculations into 6 partitions+ and then re-aggregrate – seems to be significantly faster than just performing the same operation on the full dataset.
In short – I am one happy camper – and advancing significantly every day in my Knime knowledge and love.