Big Data join

mlauber71 · May 14, 2021, 2:38pm

@Daniel_Weikert the Big Data environment by KNIME per se does live on a single machine and is therefore linked to the resources of that environment. I do not know any way to use it on more than one instance; typically, that would be a job of a ‘real’ big data system like a Cloudera system or something derived from the Apache stack.

Then: Big Data is not some magical thing but is there so that you can scale up operations on a potentially unlimited amount of data while the single nodes within a Big Data system can execute their jobs independently (and will later collect the results). So the use of Big Data should be considered if your data is so big that ‘normal’ databases can no longer handle the data.

So you can use Hive technology with the environment but you still would be limited to the resources of your machine.

I would use the local Big Data environment to develop use cases on my machine and then deploy them to a ‘real’ big data environment. An example is given here:

Then the local environment does not provide PySpark. You would have to individually install and establish that.

If you are interested in further information about these subjects you could explore this collection:

Here is a collection of methods you could execute from KNIME on a Cloudera Big Data system: