Random Forest on AWS EMR

Hello all,

When I run a workflow that contains about 614k rows, using table to spark, spark category to number, spark partitioning and then forest learner, on forest learner it stucks on 180-190th job and never move more. The EMR status is alive, other jobs take 8-15 minutes, so may be the problem in KNIME client(4.0.1) ? Job is freezed to 16 hours and more as shows Duration column from YARN application. The EMR cluster is run in yarn mode(cluster mode).

All possibly suggestions are welcome

Thanks,

Hi @sirev

using the Table to Spark node for larger dataset is problematic because it gives you a DataFrame with only one partition., which is stored on the Spark driver. Hence any computation on that DataFrame is not parallel (only one partition) and happens on the Spark driver, not the Spark executors. I suggest using the Spark Repartiation node to increase the number of partitions (see node description) and then writing the data into a Parquet file in HDFS. Then you can read it back in again, which ensures that you can do proper parallel computation.

Björn

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.