Suggestions for apply the worth approach on a taxi prediction problem

gujodm · September 22, 2017, 10:54am

Hi collegues,

I have imagine a real world situation in which a taxi society needs some kind of predictive answers in order to improve its business. The idea would be to create a Knime workflow in which, given some specific datetime data, the model can suggest which is the best area to go for pickup more clients and improve the profit... in general the area with the highest demand as pickup requests, depending on time constraints.

I have take this example data from a past Kaggle competition in which the goal was completely different, if I remember well the goal was to predict the trip duration:

https://www.kaggle.com/c/nyc-taxi-trip-duration/data

Anyway I have take this data because it has the most useful structure for start thinking about taxi data:

id - a unique identifier for each trip
vendor_id - a code indicating the provider associated with the trip record
pickup_datetime - date and time when the meter was engaged
dropoff_datetime - date and time when the meter was disengaged
passenger_count - the number of passengers in the vehicle (driver entered value)
pickup_longitude - the longitude where the meter was engaged
pickup_latitude - the latitude where the meter was engaged
dropoff_longitude - the longitude where the meter was disengaged
dropoff_latitude - the latitude where the meter was disengaged
store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
trip_duration - duration of the trip in seconds

So.. considering the main goal of predicting which area is the best deal for catch more clients depending on datetime parameters, I have converted latitude and longitude data in coordinates with palladian nodes, and then with the k-medoids I have divided each addresses in group areas using the Haversine distance. Then I have reduced the dimensionality of the data by applying the binning for time intervals and by dividing hour, day, month and year.

Then I have made some transformations and column filtering for remove useless columns and retrieve some additional informations like the number of pickup races for each hour of a day.

Now my problem is to understand which is the smartest approach for apply the classification part for answer to my predictive question.

I have though about predicting the Cluster column as class label for my learner. But of course I'm not sure if is good idea or not, also because considering my current data I don't know exactly what should I need to pass to che learner as column required. Probably I should need to integrate them with historical weather information, or maybe with some events information at a particular date, or a column that identifies holidays and workingdays.

So.. the first attempt with the random forest learner node was not so good, because as tree representation I got something very strange. Each split of the tree was something like 0% or 100%. I never encountered a split situation with different percentage score. And this makes me think that probably I'm wrong about something.

Someone can try to help me to understand what is the best approach for solve this kind of problem?

Or maybe suggest me if the data choose are totally wrong for this prediction process? And if yes, What kind of columns data should I need to have for build a correct model with a classifier? Maybe with acceptable accuracy score?

Thanks in advice.

-Giulio

Vincenzo · September 29, 2017, 3:17am

Hi giulio89,

This sounds an interesting project! It sounds to be a geospatial analysis.

A couple of suggestions here:

- I guess that adding historical weather data and other events information would reasonably help to have better predictions;

- Regarding the location features: probably, a discretization of the the location data would help to use this feature as a target variable in your predictive model. Maybe this link could be helpful: https://en.wikipedia.org/wiki/Geohash

Hope that helps,

Best,

Vincenzo