Cleaning the NYC taxi dataset on Spark

This workflow handles the preprocessing of the NYC taxi dataset (loading, cleaning, filtering, etc). The NYC taxi dataset contains over 1 billion taxi trips in New York City between January 2009 and December 2017 and is provided by the NYC Taxi and Limousine Commision (TLC)[1]. It contains not only information about the regular yellow cabs, but also green taxis, which started in August 2013, and For-Hire Vehicle (e.g Uber) starting from January 2015. In the data, each taxi trip is recorded with information such as the pickup and dropoff locations, datetime, number of passengers, trip distance, fare amount, tip amount, etc. Since the dataset was first published, the TLC has made several changes to it, e.g renaming, adding, removing some columns. Therefore, we need to do some preprocessing steps before loading the data into the database. The goal of this workflow is to get the dataset from [1], then load them onto Spark for preprocessing. The preprocessing includes unifying the columns (names, values, datatypes), reverse geocoding (assigning GPS coordinates or location IDs to their corresponding taxi zones), and filtering negative values that don't make sense. At the end, the cleaned data are stored on an Amazon S3 bucket in Parquet format, ready for further analysing. [1] http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml


This is a companion discussion topic for the original entry at https://kni.me/w/yZI74OtdOBVajpsT