One of our client is setting up for us an on-premises space for a big data (2TB) analytics project. The project is a proof-of-concept one. I’d need to give the specifications for the machine needs. We are planning to work with KNIME Analytics Platform, Anaconda and Spark and access this dedicated space via remote desktop.
Could you please advise?
Yes, I am aware that there isn’t enough specification, sorry. We figured we’d use a machine with 50GB memory, put a couple of VMs on it and thus set up the environment. I have no idea about the tasks coming with this dataset – from ETL to ML – as I haven’t see the data. I just want to be sure that there is enough memory to do whatever needs to be done.
Thanks,
Ribizli
thanks for following up. In my experience 50GB of memory will get you quite far, though I have also seen setups with 100+GB.
In any case: this is something that can partially be accommodated for in workflow design. For example, one can use chunked loops to process very large datasets in pieces, to reduce the memory load at any single point in time. Additionally, demanding workflows can be scheduled at quiet/differing times.
Workflow designers should also be informed about the servers capacity so that they can test (with a subset of data) on their own machine, to see how costly their workflow is.
Should you ever run into memory issues, please also check the hard disk – if a partition we write to is full, jobs and data may be kept in memory.