Suggestions for apply the worth approach on a taxi prediction problem

Hi collegues,

I have imagine a real world situation in which a taxi society needs some kind of predictive answers in order to improve its business. The idea would be to create a Knime workflow in which, given some specific datetime data, the model can suggest which is the best area to go for pickup more clients and improve the profit... in general the area with the highest demand as pickup requests, depending on time constraints.

I have take this example data from a past Kaggle competition in which the goal was completely different, if I remember well the goal was to predict the trip duration:

Anyway I have take this data because it has the most useful structure for start thinking about taxi data:

  • id - a unique identifier for each trip

  • vendor_id - a code indicating the provider associated with the trip record

  • pickup_datetime - date and time when the meter was engaged

  • dropoff_datetime - date and time when the meter was disengaged

  • passenger_count - the number of passengers in the vehicle (driver entered value)

  • pickup_longitude - the longitude where the meter was engaged

  • pickup_latitude - the latitude where the meter was engaged

  • dropoff_longitude - the longitude where the meter was disengaged

  • dropoff_latitude - the latitude where the meter was disengaged

  • store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip

  • trip_duration - duration of the trip in seconds


So.. considering the main goal of predicting which area is the best deal for catch more clients depending on datetime parameters, I have converted latitude and longitude data in coordinates with palladian nodes, and then with the k-medoids I have divided each addresses in group areas using the Haversine distance. Then I have reduced the dimensionality of the data by applying the binning for time intervals and by dividing hour, day, month and year.

Then I have made some transformations and column filtering for remove useless columns and retrieve some additional informations like the number of pickup races for each hour of a day. 

Now my problem is to understand which is the smartest approach for apply the classification part for answer to my predictive question.

I have though about predicting the Cluster column as class label for my learner. But of course I'm not sure if is good idea or not, also because considering my current data I don't know exactly what should I need to pass to che learner as column required. Probably I should need to integrate them with historical weather information, or maybe with some events information at a particular date, or a column that identifies holidays and workingdays.

So.. the first attempt with the random forest learner node was not so good, because as tree representation I got something very strange. Each split of the tree was something like 0% or 100%. I never encountered a split situation with different percentage score. And this makes me think that probably I'm wrong about something.

Someone can try to help me to understand what is the best approach for solve this kind of problem?

Or maybe suggest me if the data choose are totally wrong for this prediction process? And if yes, What kind of columns data should I need to have for build a correct model with a classifier? Maybe with acceptable accuracy score?

Thanks in advice.



Hi giulio89,

This sounds an interesting project! It sounds to be a geospatial analysis.

A couple of suggestions here:

- I guess that adding historical weather data and other events information would reasonably help to have better predictions;

- Regarding the location features: probably, a discretization of the the location data would help to use this feature as a target variable in your predictive model. Maybe this link could be helpful:

Hope that helps,