Forecasting Pickup Densities For Yellow Taxi Services In New York City:¶

#Use dask to read only necessary columns at a time. Speed up EDA
import dask.dataframe as dd#similar to pandas
import pandas as pd#pandas to create small dataframes 
import folium#open street map
import datetime#Convert to unix time
import time#Convert to unix time
import numpy as np#Do aritmetic operations on arrays
import seaborn as sns#Plots
import matplotlib.pylab as plt#Plots
import matplotlib  #Plots
from matplotlib import rcParams#Size of plots  
import gpxpy.geo#Get the haversine distance
from sklearn.cluster import MiniBatchKMeans, KMeans#Clustering
import math
import pickle
import os
mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-5.3.0-posix-seh-rt_v4-rev0\\mingw64\\bin'
os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']
import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

#There are 19 columns
month = dd.read_csv('yellow_tripdata_2015-01.csv')
print(month.columns)

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'pickup_longitude',
       'pickup_latitude', 'RateCodeID', 'store_and_fwd_flag',
       'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount',
       'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount'],
      dtype='object')

Data Information¶

pick-up and drop-off dates/times,
pick-up and drop-off locations,
trip distances,
itemized fares,
rate types,
payment types,
driver-reported passenger counts.

The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC)

Information on taxis:¶

Yellow Taxi: Yellow Medallion Taxicabs

These are the famous NYC yellow taxis that provide transportation exclusively through street-hails. The number of taxicabs is limited by a finite number of medallions issued by the TLC. You access this mode of transportation by standing in the street and hailing an available taxi with your hand. The pickups are not pre-arranged.

For Hire Vehicles (FHVs)

FHV transportation is accessed by a pre-arrangement with a dispatcher or limo company. These FHVs are not permitted to pick up passengers via street hails, as those rides are not considered pre-arranged.

Green Taxi: Street Hail Livery (SHL)

The SHL program will allow livery vehicle owners to license and outfit their vehicles with green borough taxi branding, meters, credit card machines, and ultimately the right to accept street hails in addition to pre-arranged rides.

Credits: Quora

Footnote:

In the given notebook we are considering only the yellow taxis for the year of 2015

Data Collection¶

We Have collected all yellow taxi trips data from jan-2015 to dec-2016(Will be using only 2015 data)

file name	file name size	number of records	number of features
yellow_tripdata_2016-01	1. 59G	10906858	19
yellow_tripdata_2016-02	1. 66G	11382049	19
yellow_tripdata_2016-03	1. 78G	12210952	19
yellow_tripdata_2016-04	1. 74G	11934338	19
yellow_tripdata_2016-05	1. 73G	11836853	19
yellow_tripdata_2016-06	1. 62G	11135470	19
yellow_tripdata_2016-07	884Mb	10294080	17
yellow_tripdata_2016-08	854Mb	9942263	17
yellow_tripdata_2016-09	870Mb	10116018	17
yellow_tripdata_2016-10	933Mb	10854626	17
yellow_tripdata_2016-11	868Mb	10102128	17
yellow_tripdata_2016-12	897Mb	10449408	17
yellow_tripdata_2015-01	1.84Gb	12748986	19
yellow_tripdata_2015-02	1.81Gb	12450521	19
yellow_tripdata_2015-03	1.94Gb	13351609	19
yellow_tripdata_2015-04	1.90Gb	13071789	19
yellow_tripdata_2015-05	1.91Gb	13158262	19
yellow_tripdata_2015-06	1.79Gb	12324935	19
yellow_tripdata_2015-07	1.68Gb	11562783	19
yellow_tripdata_2015-08	1.62Gb	11130304	19
yellow_tripdata_2015-09	1.63Gb	11225063	19
yellow_tripdata_2015-10	1.79Gb	12315488	19
yellow_tripdata_2015-11	1.65Gb	11312676	19
yellow_tripdata_2015-12	1.67Gb	11460573	19

Features in the dataset:¶

Field Name	Description
VendorID	A code indicating the TPEP provider that provided the record. Creative Mobile Technologies VeriFone Inc.
tpep_pickup_datetime	The date and time when the meter was engaged.
tpep_dropoff_datetime	The date and time when the meter was disengaged.
Passenger_count	The number of passengers in the vehicle. This is a driver-entered value.
Trip_distance	The elapsed trip distance in miles reported by the taximeter.
Pickup_longitude	Longitude where the meter was engaged.
Pickup_latitude	Latitude where the meter was engaged.
RateCodeID	The final rate code in effect at the end of the trip. Standard rate JFK Newark Nassau or Westchester Negotiated fare Group ride
Store_and_fwd_flag	This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip
Dropoff_longitude	Longitude where the meter was disengaged.
Dropoff_ latitude	Latitude where the meter was disengaged.
Payment_type	A numeric code signifying how the passenger paid for the trip. Credit card Cash No charge Dispute Unknown Voided trip
Fare_amount	The time-and-distance fare calculated by the meter.
Extra	Miscellaneous extras and surcharges. Currently, this only includes. the $0.50 and $1 rush hour and overnight charges.
MTA_tax	0.50 MTA tax that is automatically triggered based on the metered rate in use.
Improvement_surcharge	0.30 improvement surcharge assessed trips at the flag drop. the improvement surcharge began being levied in 2015.
Tip_amount	Tip amount – This field is automatically populated for credit card tips.Cash tips are not included.
Tolls_amount	Total amount of all tolls paid in trip.
Total_amount	The total amount charged to passengers. Does not include cash tips.

samples = 12748986+12450521+13351609+13071789+13158262+12324935+11562783+11130304+11225063+12315488+11312676+11460573
print(samples)

146112989

Total number of samples for the year of 2015:¶

146 million

Problem statement(Change)¶

Find out the pick up density give the time and a region.

We will be taking the data of year 2015(12 months data) as train data, and test our model on year 2016 data.

EDA for the months JAN - DEC¶

Eventful days in january:¶

Thursday :January 01 New Years Day
Monday :January 19 Martin Luther King Jr. Day Third Monday in January
Sunday-Tuesday :January 25-27 Blizzard shuts down city

Eventful days in february:¶

Thursday :February 12 Lincoln's Birthday
Monday :February 16 Presidents' Day

Eventful days in March¶

No repeating events

month.head(5)

Data Cleaning:¶

1. Removing the points outside of new york:¶

New york bounding box:¶

min_lat <- 40.5774
max_lat <- 40.9176
min_long <- -74.15
max_long <- -73.7004

outlier_locations = month[(month.pickup_longitude != 0 and month.pickup_latitude != 0) & ((month.pickup_longitude <= -74.15) or (month.pickup_latitude <= 40.5774)or \
                   (month.pickup_longitude >= -73.7004) or (month.pickup_latitude >= 40.9176))]
#Optional
#Comment out if necessary
map_osm = folium.Map(location=[40.734695, -73.990372], tiles='Stamen Toner')
sample_locations = outlier_locations.head(100)
for i,j in sample_locations.iterrows():
    folium.Marker(list((j['pickup_latitude'],j['pickup_longitude']))).add_to(map_osm)
map_osm

C:\Users\JackSparrow\AppData\Local\Continuum\Anaconda3\lib\site-packages\dask\dataframe\core.py:3804: UserWarning: Insufficient elements for `head`. 100 elements requested, only 69 elements available. Try passing larger `npartitions` to `head`.
  warnings.warn(msg.format(n, len(r)))

As you can see above that there are some points just outside the boundary but many of them point to either South america, Mexico or Canada¶

2. Removing the outlier trips based on trip durations:¶

TLC Regulations:¶

According to TLC regulations the maximum allowed trip duration in a 24 hour interval is 12 hours.
This means the maximum allowed work hours for a driver is 12 hours
There are 24k trips which either violate this rule or have trip durations of 0 or less than zero(Wierd)

def convert_to_unix(s):
    return time.mktime(datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S").timetuple())

#return a dataframe with many parameters along with trip duration and trip pickup times(with outliers)
def return_with_trip_times():
    duration = month[['tpep_pickup_datetime','tpep_dropoff_datetime']].compute()
    #pickups and dropoffs to unix time
    duration_pickup = [convert_to_unix(x) for x in duration['tpep_pickup_datetime'].values]
    duration_drop = [convert_to_unix(x) for x in duration['tpep_dropoff_datetime'].values]
    #calculate duration of trips
    durations = (np.array(duration_drop) - np.array(duration_pickup))/float(60)

    #append durations of trips and pickup times in unix to a new dataframe
    new_frame = month[['VendorID','payment_type','passenger_count','trip_distance','pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','RateCodeID','store_and_fwd_flag']].compute()
    
    new_frame['trip_times'] = durations
    new_frame['pickup_times'] = duration_pickup
    new_frame['Speed'] = 60*(new_frame['trip_distance']/new_frame['trip_times'])
    
    return new_frame
#duration_dropoff

#outliers aren't still removed
#dataframe with durations and pickup unix time appended
frame_with_durations = return_with_trip_times()

def plot_boxplot(new_frame, string1, string2):
    var = new_frame[string1].values
#     print(string1,var, new_frame.shape)
    var = np.sort(var,axis = None)
#     print(string1, var.shape)
    if string2!="":
        plt.ylim(0, var[int(float(95)*len(var)/100)])
        sns.boxplot(y= string1, data = new_frame)
        sns.plt.title(string1+' in minutes without outliers')
        plt.show()
    return var
    
def remove_outliers(new_frame):
    print ("Original:",new_frame.shape[0])
    a = new_frame.shape[0]
    new_frame = new_frame[((new_frame.dropoff_longitude >= -74.15) & (new_frame.dropoff_longitude <= -73.7004) &\
                       (new_frame.dropoff_latitude >= 40.5774) & (new_frame.dropoff_latitude <= 40.9176)) & \
                       ((new_frame.pickup_longitude >= -74.15) & (new_frame.pickup_latitude >= 40.5774)& \
                       (new_frame.pickup_longitude <= -73.7004) & (new_frame.pickup_latitude <= 40.9176))]
    print ("Removing outliers from NY boundaries:",new_frame.shape[0],(a - new_frame.shape[0]))
    b = new_frame.shape[0]

    print ("\ntrip times percentiles...")
    var = plot_boxplot(new_frame, "trip_times", "")
    for i in range(0,len(var),int(float(10)*len(var)/100)):
        print ("{}th percentile {}".format(round(float(i)/len(var),3)*100,round(var[i],3)))

    print(new_frame.shape, "before frist")
#     new_frame = new_frame[(new_frame.trip_times > 0) & (new_frame.trip_times < 720)]
    print(new_frame.shape, "frist", new_frame.head())
    print ("Removing outliers from trip times:",new_frame.shape[0],(b - new_frame.shape[0]))
    var = plot_boxplot(new_frame, "trip_times", "out")
    print(new_frame.shape, "second")
#--------------------------------------------------------------
    print ("\ntrip distance percentiles...")
    var = plot_boxplot(new_frame, "trip_distance", "")
    for i in range(0,len(var),int(float(10)*len(var)/100)):
        print ("{}th percentile {}".format(round(float(i)/len(var),3)*100,round(var[i],3)))
                                   
    c = new_frame.shape[0]
    new_frame = new_frame[(new_frame.trip_distance > 0) & (new_frame.trip_distance < 23)]
    print ("Removing outliers from trip distance:",new_frame.shape[0],(c - new_frame.shape[0]))
    var = plot_boxplot(new_frame,"trip_distance", "out")
    
#--------------------------------------------------------------

    print ("\nSpeed percentiles...")
    var = plot_boxplot(new_frame, "Speed", "")  
    for i in range(0,len(var),int(float(10)*len(var)/100)):
        print ("{}th percentile {}".format(round(float(i)/len(var),3)*100,round(var[i],3)))
                                   
    d = new_frame.shape[0]
    new_frame = new_frame[(new_frame.Speed <= 65) & (new_frame.Speed >= 0)]
    print ("Removing outliers from trip speed:",new_frame.shape[0],(d - new_frame.shape[0]))
    var = plot_boxplot(new_frame, "Speed", "out")
    
#--------------------------------------------------------------   
    print( "Total outliers removed",a - new_frame.shape[0])
    return new_frame

frame_with_durations_outliers_removed = remove_outliers(frame_with_durations)

Original: 12748986
Removing outliers from NY boundaries: 12455067 293919

trip times percentiles...
0.0th percentile -1211.017
10.0th percentile 3.867
20.0th percentile 5.4
30.0th percentile 6.833
40.0th percentile 8.317
50.0th percentile 9.95
60.0th percentile 11.867
70.0th percentile 14.267
80.0th percentile 17.583
90.0th percentile 23.333
100.0th percentile 23800.017
(12455067, 13) before frist
(12455067, 13) frist    VendorID  payment_type  passenger_count  trip_distance  pickup_longitude  \
0         2             1                1           1.59        -73.993896   
1         1             1                1           3.30        -74.001648   
2         1             2                1           1.80        -73.963341   
3         1             2                1           0.50        -74.009087   
4         1             2                1           3.00        -73.971176   

   pickup_latitude  dropoff_longitude  dropoff_latitude  RateCodeID  \
0        40.750111         -73.974785         40.750618           1   
1        40.724243         -73.994415         40.759109           1   
2        40.802788         -73.951820         40.824413           1   
3        40.713818         -74.004326         40.719986           1   
4        40.762428         -74.004181         40.742653           1   

  store_and_fwd_flag  trip_times  pickup_times      Speed  
0                  N   18.050000  1.421329e+09   5.285319  
1                  N   19.833333  1.420902e+09   9.983193  
2                  N   10.050000  1.420902e+09  10.746269  
3                  N    1.866667  1.420902e+09  16.071429  
4                  N   19.316667  1.420902e+09   9.318378  
Removing outliers from trip times: 12455067 0

(12455067, 13) second

trip distance percentiles...
0.0th percentile 0.0
10.0th percentile 0.65
20.0th percentile 0.9
30.0th percentile 1.1
40.0th percentile 1.37
50.0th percentile 1.68
60.0th percentile 2.06
70.0th percentile 2.6
80.0th percentile 3.57
90.0th percentile 5.9
100.0th percentile 8000016.5
Removing outliers from trip distance: 12399797 55270

Speed percentiles...
0.0th percentile -0.228
10.0th percentile 6.413
20.0th percentile 7.811
30.0th percentile 8.929
40.0th percentile 9.98
50.0th percentile 11.066
60.0th percentile 12.281
70.0th percentile 13.785
80.0th percentile 15.939
90.0th percentile 20.104
100.0th percentile inf
Removing outliers from trip speed: 12391863 7934

Total outliers removed 357123

frame_with_durations_outliers_removed.head()

import seaborn as sns
#trip times
def dist_of_params(frame,variable,title):
    sns.FacetGrid(frame,size=6) \
      .map(sns.kdeplot, variable) \
      .add_legend();
    sns.plt.title(title)
    plt.show();

#trip times
dist_of_params(frame_with_durations_outliers_removed,'trip_times','time for cab trips distribution(in minutes)')

#log trip times
log_trip_times = frame_with_durations_outliers_removed.trip_times.values
frame_with_durations_outliers_removed['log_times'] = np.log(log_trip_times)
dist_of_params(frame_with_durations_outliers_removed,'log_times','log of time for cab trips distribution')

#trip distances
dist_of_params(frame_with_durations_outliers_removed,'trip_distance','distance for cab trips distribution')

#log trip distances
log_trip_distance = frame_with_durations_outliers_removed.trip_distance.values
frame_with_durations_outliers_removed['log_distance'] = np.log(log_trip_distance)
dist_of_params(frame_with_durations_outliers_removed,'log_distance','log of distance for cab trips distribution')

#trip speed
dist_of_params(frame_with_durations_outliers_removed,'Speed','average Speed of cab trips distribution')

#log speed
log_trip_speed = frame_with_durations_outliers_removed.Speed.values
frame_with_durations_outliers_removed['log_speed'] = np.log(log_trip_speed)
dist_of_params(frame_with_durations_outliers_removed,'log_speed','log of speed for cab trips distribution')

Univariate Analysis of following features :¶

Vendor ID
Passenger counts
Rate Code ID
store and forward flags
Payment types

Outlier Trip Distances:¶

Outlier Dropoff and pickup points:¶

To simplify things new york has been divided into 37km x 37km box and any pickups or dropoffs outside this point is referred to as an outlier¶

There are 264.4k dropoffs happening outside of new york
There are 247.742 pickups happening outside of new york
How many travels have both drop off and pick ups outside of new york??
- sdfsd

Inspecting number of pickups shared among each of the vendors¶

#Get unique vendors
vendor_unique = frame_with_durations_outliers_removed.VendorID.unique()
Vendors = []
#Get the number of trips for each vendor
Vendors.append(frame_with_durations_outliers_removed[frame_with_durations_outliers_removed['VendorID'] == 1].shape[0])
Vendors.append(frame_with_durations_outliers_removed[frame_with_durations_outliers_removed['VendorID'] == 2].shape[0])
#Names of vendors
print("Unique vendors\n {}: Creative Mobile Technologies\n {}: VeriFone Inc\n".format(vendor_unique[1],vendor_unique[0]))
print ("Vendor 1 and Vendor 2 count:\n",Vendors)

Unique vendors
 1: Creative Mobile Technologies
 2: VeriFone Inc

Vendor 1 and Vendor 2 count:
 [5841962, 6549901]

sns.set_style("whitegrid")

fig, ax = plt.subplots()
fig.set_size_inches(12, 5)

ax = sns.barplot(x = ['Creative Mobile Technologies', 'VeriFone Inc'] , y = np.array(Vendors))
ax.set(ylabel='Pickups')
sns.plt.title('Pickups per vendor for the month of JAN')
plt.show()

print ("Creative Mobile Technologies share :{} % VeriFone Inc share :{} % ".format(100*float(Vendors[1])/sum(Vendors),100*float(Vendors[0])/sum(Vendors)))

Creative Mobile Technologies share :52.85646718334443 % VeriFone Inc share :47.14353281665557 %

Both the vendors share almost same number of pickups in JAN month¶

#Get the number of pickups per each passenger count
passenger_count = []
passenger_uniq = set(frame_with_durations_outliers_removed['passenger_count'].values)
for i in passenger_uniq:
    passenger_count.append(frame_with_durations_outliers_removed[frame_with_durations_outliers_removed['passenger_count'] == i].shape[0])

#sort the passenger pickups based on the passenger count
pass_count_pickups = [x for _,x in sorted(zip(passenger_uniq,passenger_count))]
pass_count = [int(y) for y,_ in sorted(zip(passenger_uniq,passenger_count))]

print (pass_count)
print (pass_count_pickups)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[5841, 8717237, 1767371, 516166, 247081, 689157, 448998, 6, 2, 4]

sns.set_style("whitegrid")

fig, ax = plt.subplots()
fig.set_size_inches(12, 5)

ax = sns.barplot(x = np.array(pass_count) , y = np.array(pass_count_pickups))
ax.set(ylabel='Pickups',xlabel = 'number of passengers')
sns.plt.title('Number of pickups for every type of passenger count')
plt.show()

for a,b in zip(pass_count,pass_count_pickups):
    print ("Pickups for passenger count {} : {}% of total pickups".format(int(a),round(float(b)*100/sum(pass_count_pickups),4)))

Pickups for passenger count 0 : 0.0471% of total pickups
Pickups for passenger count 1 : 70.3465% of total pickups
Pickups for passenger count 2 : 14.2624% of total pickups
Pickups for passenger count 3 : 4.1654% of total pickups
Pickups for passenger count 4 : 1.9939% of total pickups
Pickups for passenger count 5 : 5.5614% of total pickups
Pickups for passenger count 6 : 3.6233% of total pickups
Pickups for passenger count 7 : 0.0% of total pickups
Pickups for passenger count 8 : 0.0% of total pickups
Pickups for passenger count 9 : 0.0% of total pickups

Inference from the passenger count:¶

Majority of the trips seem to have 1 as the ideal number of passengers with 2 as the close second
What do the passenger count with 0 say?
- Human error? -
- Maybe some form of goods delivery?
According to the TLC regulations 5 is the maximum number of passengers allowed in a yellow taxi
- Good amount of trips have more than 5 passengers - almost 45k of them
- Toddlers and children below age 7 are allowed to sit on laps
- Maybe the drivers take into account the toddlers too(But still 7 ,8 and 9 are ridiculous numbers for passenger counts)

Rate Code:¶

Standard rate
JFK
Newark
Nassau or Westchester
Negotiated fare
Group ride

#Unique rate codes
rate_codes_uniq = set(frame_with_durations_outliers_removed['RateCodeID'].values)
rate_count = []
for i in rate_codes_uniq:
    rate_count.append(frame_with_durations_outliers_removed[frame_with_durations_outliers_removed['RateCodeID'] == i].shape[0])
#sort the pickups based on the rate codes
rate_code_pickups = [x for _,x in sorted(zip(rate_codes_uniq,rate_count))]
rate_code = [int(y) for y,_ in sorted(zip(rate_codes_uniq,rate_count))]
Trips = ['Standard rates','JFK trips','Newark trips','Nassau/Westchester trips','Negotiated fare','Group rides','unknownrate code']

sns.set_style("whitegrid")

fig, ax = plt.subplots()
fig.set_size_inches(12, 5)

ax = sns.barplot(x = np.array(Trips) , y = np.array(rate_code_pickups))
ax.set(ylabel='Pickups',xlabel = 'Rate Code IDs')
sns.plt.title('Number of pickups for different types of rate codes')
plt.show()

for a,b in zip(Trips,rate_code_pickups):
    print ("Pickups for/with {} accounts {}% of total pickups".format(a,round(float(b)*100/sum(rate_code_pickups),4)))

Pickups for/with Standard rates accounts 98.2089% of total pickups
Pickups for/with JFK trips accounts 1.6787% of total pickups
Pickups for/with Newark trips accounts 0.0072% of total pickups
Pickups for/with Nassau/Westchester trips accounts 0.0133% of total pickups
Pickups for/with Negotiated fare accounts 0.0894% of total pickups
Pickups for/with Group rides accounts 0.0003% of total pickups
Pickups for/with unknownrate code accounts 0.0023% of total pickups

Rate Codes(Findings from the data):¶

Standard rate - 12.4 Million trips
JFK - 224.2k trips
Newark - 17.7k trips
Nassau or Westchester - 4.1k trips
Negotiated fare - 36.8k trips
Group ride - 134 trips
rate code 99(mistake??) - 507 trips

Store and forward flags¶

SWflag_uniq = set(frame_with_durations_outliers_removed['store_and_fwd_flag'].values)
SWflag_count = []
for i in SWflag_uniq :
    SWflag_count.append(frame_with_durations_outliers_removed[frame_with_durations_outliers_removed['store_and_fwd_flag'] == i].shape[0])
print (SWflag_count)
SWflag_uniq = list(SWflag_uniq)
print (SWflag_uniq)

[12286027, 105836]
['N', 'Y']

for a,b in zip(SWflag_uniq,SWflag_count):
    print ("Pickups for/with store and forward = {} accounts {}% of total pickups".format(a,round(float(b)*100/sum(SWflag_count),4)))

Pickups for/with store and forward = N accounts 99.1459% of total pickups
Pickups for/with store and forward = Y accounts 0.8541% of total pickups

sns.set_style("whitegrid")

fig, ax = plt.subplots()
fig.set_size_inches(12, 5)

ax = sns.barplot(x = np.array(SWflag_uniq) , y = np.array(SWflag_count))
ax.set(ylabel='Pickups',xlabel = 'store_and_fwd_flag')
sns.plt.title('Number of pickups for store forward flags')
plt.show()

Payment types:¶

Credit card
Cash
No charge
Dispute
Unknown
Voided trip

payment_uniq = set(frame_with_durations_outliers_removed['payment_type'].values)
payment_count = []
for i in payment_uniq :
    payment_count.append(frame_with_durations_outliers_removed[frame_with_durations_outliers_removed['payment_type'] == i].shape[0])
print (payment_count)
payment_uniq = list(payment_uniq)
print (payment_uniq)

[7685155, 4670834, 26634, 9238, 2]
[1, 2, 3, 4, 5]

pay_pickups = [x for _,x in sorted(zip(payment_uniq,payment_count))]
pay = [int(y) for y,_ in sorted(zip(payment_uniq,payment_count))]
pay_type = ['Credit card','Cash','No charge','Dispute','Unknow']

sns.set_style("whitegrid")

fig, ax = plt.subplots()
fig.set_size_inches(12, 5)

ax = sns.barplot(x = np.array(pay_type) , y = np.array(pay_pickups))
ax.set(ylabel='Pickups',xlabel = 'payment types')
sns.plt.title('Number of pickups for each payment types')
plt.show()

for a,b in zip(pay_type,pay_pickups):
    print ("Pickups with payment types {} accounts to :{}% of total pickups".format(a,round(float(b)*100/sum(pay_pickups),4)))

Pickups with payment types Credit card accounts to :62.0178% of total pickups
Pickups with payment types Cash accounts to :37.6928% of total pickups
Pickups with payment types No charge accounts to :0.2149% of total pickups
Pickups with payment types Dispute accounts to :0.0745% of total pickups
Pickups with payment types Unknow accounts to :0.0% of total pickups

Pickup and dropoff locations on maps:¶

frame_with_durations_outliers_removed['lat_bin'] = np.round(frame_with_durations_outliers_removed['pickup_latitude'],4)
frame_with_durations_outliers_removed['lon_bin'] = np.round(frame_with_durations_outliers_removed['pickup_longitude'],4)

coords = frame_with_durations_outliers_removed[['pickup_latitude', 'pickup_longitude']].values

#Getting 40 clusters using the kmeans 
kmeans = MiniBatchKMeans(n_clusters=40, batch_size=10000).fit(coords)
frame_with_durations_outliers_removed['pickup_cluster'] = kmeans.predict(frame_with_durations_outliers_removed[['pickup_latitude', 'pickup_longitude']])
frame_with_durations_outliers_removed['dropoff_cluster'] = kmeans.predict(frame_with_durations_outliers_removed[['dropoff_latitude', 'dropoff_longitude']])

#Get regions
def Convert_Clusters(frame,cluster):
    region=[]
    colors = []
    Queens=[4,16,22,23,26,35,36]
    Brooklyn=[11,19,29,37]
    for i in frame[cluster].values:
        if i==2:
            region.append("JFK")
            #Green - JFK
            colors.append('#7CFC00')
        elif i==13:
            region.append("Bronx")
            #Red - Bronx
            colors.append('#DC143C')
        elif i in Queens:
            region.append("Queens")
            #Blue - Queens
            colors.append('#00FFFF')
        elif i in Brooklyn:
            region.append("Brooklyn")
            #Brooklyn - yellow orange
            colors.append('#FFD700')
        else:
            region.append("Manhattan")
            #White - manhattan
            colors.append('#FFFFFF')
    frame['Regions '+cluster]=region
    return frame,colors

frame_with_durations_outliers_removed,colors_pickup=Convert_Clusters(frame_with_durations_outliers_removed,'pickup_cluster')
frame_with_durations_outliers_removed,colors_dropoff=Convert_Clusters(frame_with_durations_outliers_removed,'dropoff_cluster')
frame_with_durations_outliers_removed.head()

def plot_scatter_locations(frame,colors,choose):
    %matplotlib inline 

    pd.options.display.mpl_style = 'default' #Better Styling  
    new_style = {'grid': False} #Remove grid  
    matplotlib.rc('axes', **new_style)  
    rcParams['figure.figsize'] = (17.5, 17) #Size of figure  
    rcParams['figure.dpi'] = 250
    
    var1 = choose+'_longitude'
    var2 = choose+'_latitude'

    P=frame.plot(kind='scatter', x=[var1], y=[var2],color='white',xlim=(-74.06,-73.77),ylim=(40.61, 40.91),s=.02,alpha=.6)
    P.set_axis_bgcolor('black') #Background Color
    plt.savefig('plot_'+choose+'.png')
    plt.show()
    

plot_scatter_locations(frame_with_durations_outliers_removed,colors_pickup,'pickup')
plot_scatter_locations(frame_with_durations_outliers_removed,colors_dropoff,'dropoff')

C:\Users\JackSparrow\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel_launcher.py:19: FutureWarning: 
mpl_style had been deprecated and will be removed in a future version.
Use `matplotlib.pyplot.style.use` instead.

C:\Users\JackSparrow\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel_launcher.py:14: MatplotlibDeprecationWarning: The set_axis_bgcolor function was deprecated in version 2.0. Use set_facecolor instead.

C:\Users\JackSparrow\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel_launcher.py:20: FutureWarning: 
mpl_style had been deprecated and will be removed in a future version.
Use `matplotlib.pyplot.style.use` instead.

INFERENCE:

The pickup locations seem to be concentrated in the manhattan area
The dropoff locations seem to be spread over the brooklyn and queens as will along with the manhattan area

def compute_pickups_per_day():
    time = month[['tpep_pickup_datetime','tpep_dropoff_datetime']].compute()
    time['tpep_pickup_datetime'] = pd.to_datetime(time['tpep_pickup_datetime'])
    time["pickup_day"] = time['tpep_pickup_datetime'].dt.strftime('%u').astype(int)
    time["pickup_hour"] = time['tpep_pickup_datetime'].dt.strftime('%H').astype(int)
    set(time['pickup_day'].values)
    day_pickups = []
    night_pickups = []
    mid_night_pickups = []
    afternoon_pickups = []
    for i in range(1,8):
        day_pickups.append(time[(time.pickup_day == i) & (time.pickup_hour >= 6) & (time.pickup_hour < 12)].shape[0])
        night_pickups.append(time[(time.pickup_day == i) & (time.pickup_hour >= 18) & (time.pickup_hour <= 23)].shape[0])
        mid_night_pickups.append(time[(time.pickup_day == i) & (time.pickup_hour >= 0) & (time.pickup_hour < 6)].shape[0])
        afternoon_pickups.append(time[(time.pickup_day == i) & (time.pickup_hour >= 12) & (time.pickup_hour < 18)].shape[0])
    
    days =pd.DataFrame({'morning':day_pickups, 'afternoon':afternoon_pickups, 'night':night_pickups, 'midnight':mid_night_pickups})
    days.plot(kind='bar', stacked=True,figsize=(4.5, 4.5))
    plt.xlabel('Days of the week')
    plt.ylabel('Number of pickups')
    plt.tick_params(axis='both', which='minor')
    plt.plot()
    #plt.xticks(['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
    days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
    for i in range(0,len(days)):
        print ("{} -  morning:{} , day:{} , mid-day:{} night:{}".format(days[i],mid_night_pickups[i],day_pickups[i],afternoon_pickups[i],night_pickups[i]))
    
    return time

times_of_day_dframe = compute_pickups_per_day()

Monday -  morning:108576 , day:408179 , mid-day:434850 night:381266
Tuesday -  morning:76339 , day:372232 , mid-day:415968 night:509383
Wednesday -  morning:104280 , day:481818 , mid-day:488264 night:609625
Thursday -  morning:260968 , day:545611 , mid-day:606775 night:754859
Friday -  morning:200685 , day:547113 , mid-day:629661 night:849728
Saturday -  morning:433710 , day:399576 , mid-day:724563 night:821859
Sunday -  morning:377923 , day:276211 , mid-day:525948 night:403016

Inference:¶

More commute happens during the day[6:00 to 18:59 hrs] time compared to the night[19:00 to 23:59 hrs] or morning[00:00 to 5:59 hrs]
Saturday seems to be the busiest days for the taxi drivers overall
Most midnight commutes happen on saturdays
Interestingly more commute happens during the midnight compared to on sunday than on saturday
Midnight commute during on saturdays are the highest and consecutively friday night commutes seem to be the highest(Friday night parties)

#pickups and distances travelled at what time of the day??
def pickup_in_day(frame):
    hour_pickups = []
    temp = []
    for i in range(1,8):
        for j in range(0,24):
            temp.append(frame[(frame.pickup_day == i) & (frame.pickup_hour == j)].shape[0])
        hour_pickups.append(temp)
        temp = []
    colors = ['xkcd:blue','xkcd:orange','xkcd:brown','xkcd:coral','xkcd:magenta','xkcd:green','xkcd:fuchsia']
    days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']

    plt.figure(figsize=(8,4))
    hours_lis = [s for s in range(0,24)]
    for k in range(0,7):
        plt.plot(hours_lis,hour_pickups[k],colors[k],label = days[k])
        plt.plot(hours_lis,hour_pickups[k], 'ro',  markersize=2)

    plt.xticks([s for s in range(0,24)])
    plt.xlabel('Hours of a day')
    plt.ylabel('Number of pickups')
    plt.title('Pickups for every hour')
    plt.legend()
    plt.grid(True)
    plt.show()
    
    hour_pickup_month = []
    for j in range(0,24):
        hour_pickup_month.append(frame[frame.pickup_hour == j].shape[0])

    plt.figure(figsize=(8,4))
    hours_lis = [s for s in range(0,24)]
    plt.plot(hours_lis,hour_pickup_month,'xkcd:magenta',label = 'average pickups per hour')
    plt.plot(hours_lis,hour_pickup_month, 'ro',  markersize=2)

    plt.xticks([s for s in range(0,24)])
    plt.xlabel('Hours of a day')
    plt.ylabel('Number of pickups')
    plt.title('Pickups for every hour for whole of month')
    plt.legend()
    plt.grid(True)
    plt.show()

pickup_in_day(times_of_day_dframe)

Clustering of regions:¶

frame_with_durations_outliers_removed['lat_bin'] = np.round(frame_with_durations_outliers_removed['pickup_latitude'],4)
frame_with_durations_outliers_removed['lon_bin'] = np.round(frame_with_durations_outliers_removed['pickup_longitude'],4)

#Find the best cluster where least number of clusters have a min. radius of atleast 1km or more
coords = frame_with_durations_outliers_removed[['pickup_latitude', 'pickup_longitude']].values

def find_min_distance(cluster_centers, cluster_len):
    nice_points = 0
    wrong_points = 0
    for i in range(0, cluster_len-1):
        min_value = 1000000
        for j in range(i+1, cluster_len):
            min_value = min(min_value, gpxpy.geo.haversine_distance(cluster_centers[i][0], cluster_centers[i][1],cluster_centers[j][0], cluster_centers[j][1]))
        if min_value!=1000000 and (min_value/1000) >= 1:
            nice_points +=1
        else:
            wrong_points += 1
    print (cluster_len, "Cents with min distance < 1:", wrong_points, "Cents with min distance > 1:", nice_points)

    if nice_points == cluster_len-1:
        return 1
    else:
        return 0

def find_clusters(increment):
    kmeans = MiniBatchKMeans(n_clusters=increment, batch_size=10000).fit(coords)
    frame_with_durations_outliers_removed['pickup_cluster'] = kmeans.predict(frame_with_durations_outliers_removed[['pickup_latitude', 'pickup_longitude']])
    cluster_centers = kmeans.cluster_centers_
    cluster_len = len(cluster_centers)
    return cluster_centers, cluster_len

for increment in range(10, 100, 10):
    cluster_centers, cluster_len = find_clusters(increment)
    more_than_5km = find_min_distance(cluster_centers, cluster_len)

10 Cents with min distance < 1: 0 Cents with min distance > 1: 9
20 Cents with min distance < 1: 1 Cents with min distance > 1: 18
30 Cents with min distance < 1: 6 Cents with min distance > 1: 23
40 Cents with min distance < 1: 11 Cents with min distance > 1: 28
50 Cents with min distance < 1: 21 Cents with min distance > 1: 28
60 Cents with min distance < 1: 29 Cents with min distance > 1: 30
70 Cents with min distance < 1: 37 Cents with min distance > 1: 32
80 Cents with min distance < 1: 45 Cents with min distance > 1: 34
90 Cents with min distance < 1: 53 Cents with min distance > 1: 36

Inference:¶

The main objective was to find a optimal min. distance(Which roughly estimates to the radius of a cluster) between the clusters which we got was 40

#Getting 40 clusters using the kmeans 
kmeans = MiniBatchKMeans(n_clusters=40, batch_size=10000).fit(coords)
frame_with_durations_outliers_removed['pickup_cluster'] = kmeans.predict(frame_with_durations_outliers_removed[['pickup_latitude', 'pickup_longitude']])

#Adding a new column describing what cluster the pickup point belongs to
frame_with_durations_outliers_removed.head(5)

#Plotting the cluster centers on OSM
cluster_centers = kmeans.cluster_centers_
cluster_len = len(cluster_centers)
map_osm = folium.Map(location=[40.734695, -73.990372], tiles='Stamen Toner')
for i in range(cluster_len):
    folium.Marker(list((cluster_centers[i][0],cluster_centers[i][1])), popup=(str(cluster_centers[i][0])+str(cluster_centers[i][1]))).add_to(map_osm)
map_osm

Plotting the clusters:¶

def plot_clusters(frame):
    city_long_border = (-74.03, -73.75)
    city_lat_border = (40.63, 40.85)
    fig, ax = plt.subplots(ncols=1, nrows=1)
    ax.scatter(frame.pickup_longitude.values[:100000], frame.pickup_latitude.values[:100000], s=10, lw=0,
               c=frame.pickup_cluster.values[:100000], cmap='tab20', alpha=0.2)
    ax.set_xlim(city_long_border)
    ax.set_ylim(city_lat_border)
    ax.set_xlabel('Longitude')
    ax.set_ylabel('Latitude')
    plt.show()

plot_clusters(frame_with_durations_outliers_removed)

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	RateCodeID	store_and_fwd_flag	dropoff_longitude	dropoff_latitude	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount
0	2	2015-01-15 19:05:39	2015-01-15 19:23:42	1	1.59	-73.993896	40.750111	1	N	-73.974785	40.750618	1	12.0	1.0	0.5	3.25	0.3	17.05
1	1	2015-01-10 20:33:38	2015-01-10 20:53:28	1	3.30	-74.001648	40.724243	1	N	-73.994415	40.759109	1	14.5	0.5	0.5	2.00	0.3	17.80
2	1	2015-01-10 20:33:38	2015-01-10 20:43:41	1	1.80	-73.963341	40.802788	1	N	-73.951820	40.824413	2	9.5	0.5	0.5	0.00	0.3	10.80
3	1	2015-01-10 20:33:39	2015-01-10 20:35:31	1	0.50	-74.009087	40.713818	1	N	-74.004326	40.719986	2	3.5	0.5	0.5	0.00	0.3	4.80
4	1	2015-01-10 20:33:39	2015-01-10 20:52:58	1	3.00	-73.971176	40.762428	1	N	-74.004181	40.742653	2	15.0	0.5	0.5	0.00	0.3	16.30

	VendorID	payment_type	passenger_count	trip_distance	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	RateCodeID	store_and_fwd_flag	...	Speed	log_times	log_distance	log_speed	lat_bin	lon_bin	pickup_cluster	dropoff_cluster	Regions pickup_cluster	Regions dropoff_cluster
0	2	1	1	1.59	-73.993896	40.750111	-73.974785	40.750618	1	N	...	5.285319	2.893146	0.463734	1.664933	40.7501	-73.9939	32	27	Manhattan	Manhattan
1	1	1	1	3.30	-74.001648	40.724243	-73.994415	40.759109	1	N	...	9.983193	2.987364	1.193922	2.300903	40.7242	-74.0016	6	8	Manhattan	Manhattan
2	1	2	1	1.80	-73.963341	40.802788	-73.951820	40.824413	1	N	...	10.746269	2.307573	0.587787	2.374559	40.8028	-73.9633	28	12	Manhattan	Manhattan
3	1	2	1	0.50	-74.009087	40.713818	-74.004326	40.719986	1	N	...	16.071429	0.624154	-0.693147	2.777043	40.7138	-74.0091	9	6	Manhattan	Manhattan
4	1	2	1	3.00	-73.971176	40.762428	-74.004181	40.742653	1	N	...	9.318378	2.960968	1.098612	2.231989	40.7624	-73.9712	15	21	Manhattan	Manhattan

	VendorID	payment_type	passenger_count	trip_distance	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	RateCodeID	store_and_fwd_flag	...	Speed	log_times	log_distance	log_speed	lat_bin	lon_bin	pickup_cluster	dropoff_cluster	Regions pickup_cluster	Regions dropoff_cluster
0	2	1	1	1.59	-73.993896	40.750111	-73.974785	40.750618	1	N	...	5.285319	2.893146	0.463734	1.664933	40.7501	-73.9939	31	27	Manhattan	Manhattan
1	1	1	1	3.30	-74.001648	40.724243	-73.994415	40.759109	1	N	...	9.983193	2.987364	1.193922	2.300903	40.7242	-74.0016	2	8	Manhattan	Manhattan
2	1	2	1	1.80	-73.963341	40.802788	-73.951820	40.824413	1	N	...	10.746269	2.307573	0.587787	2.374559	40.8028	-73.9633	5	12	Manhattan	Manhattan
3	1	2	1	0.50	-74.009087	40.713818	-74.004326	40.719986	1	N	...	16.071429	0.624154	-0.693147	2.777043	40.7138	-74.0091	12	6	Manhattan	Manhattan
4	1	2	1	3.00	-73.971176	40.762428	-74.004181	40.742653	1	N	...	9.318378	2.960968	1.098612	2.231989	40.7624	-73.9712	28	21	Manhattan	Manhattan