Taxi Demand Prediction with Spark Random Forest This workflow uses a subset of the popular NYC taxi dataset and Spark Random Forest node to train a simple time series prediction model to predict taxi demand in the next hour based on data from the past hours. The input data is the number of NYC taxi trips per hour per day in the year 2017. Our goal is to predict taxi demand at a certain hour, and in order to do that we need the taxi demands in the previous N hours. The step to create N lagged columns is done in the Spark Lag Column metanode. The Find lag metanode creates a correlation matrix between the lagged columns where we can inspect the matrix visually to see the correlation, moreover it also automatically finds the value N which has the highest correlation factor with the original column of total number of trips (taxi demand) per hour. A Random Forest model is then trained using those N lagged columns, with two additional temporal features (hour of day, and day of week). We experimented with first order differencing and seasonality removal, which are a common practice to do in time series prediction, to see if they would improve our simple model. Based on the results, it seems that for regular time series often a highly parametric algorithm like a Random Forest produces good results even if trained on the full time series, without seasonality removal.
This is a companion discussion topic for the original entry at https://kni.me/w/xD3HT1dyXH6jnmCb