Using PMML Predictor over Hadoop

Hello! :)

I’d like to know if it is possible to use Hadoop distributed capabilities and cluster resources to predict in parallel when using PMML Predictor.

For instance, suppose I connect to Hadoop in some way and also load a pmml model. What I want to do is to send  a process request for Hadoop to score the data (on a Hive table, for example) using the cluster resources, taking advantage of the map phase to work in parallel.

Is it clear?

Thank you!

Hi,

you can use the "Spark PMML Model Predictor" node that is part of the (commercial) KNIME Spark Executor extension:

https://www.knime.org/knime-spark-executor#install

You use this node to do prediction on mass data using a PMML model that you have learned with other KNIME nodes. Spark will then automatically parallelize the prediction over the partitions in the Spark RDD that holds the mass data.

 

Best,

Björn