• Problem​ ​ statement​ ​ :

Source

Airbnb​​ ​ is​ ​ an​ ​ online​ ​ marketplace​ ​ and​ ​ hospitality​ ​ service,​ ​ enabling​ ​ people​ ​ to​ ​ lease​ ​ or​ ​ rent​ ​ short-term lodging​ ​ including​ ​ vacation​ ​ rentals​ ​ , ​ ​ apartment​ ​ rentals,​ ​ homestays​ ​ , ​ ​ hostels​ ​ beds,​ ​ or​ ​ hotel​ ​ roo​ ms. New​ ​ users​ ​ on​ ​ Airbnb​ ​ can​ ​ book​ ​ a ​ ​ place​ ​ to​ ​ stay​ ​ in​ ​ 34,000+​ ​ cities​ ​ across​ ​ 190+​ ​ countries.​ ​ By​ ​ accurately predicting​ ​ where​ ​ a ​ ​ new​ ​ user​ ​ will​ ​ book​ ​ their​ ​ first​ ​ travel​ ​ experience,​ ​ Airbnb​ ​ can​ ​ share​ ​ more personalized​ ​ content​ ​ with​ ​ their​ ​ community,​ ​ decrease​ ​ the​ ​ average​ ​ time​ ​ to​ ​ first​ ​ booking,​ ​ and​ ​ better forecast​ ​ demand.​ ​ We​ ​ need​ ​ to​ ​ predict​ ​ the​ ​ first​ ​ travel​ ​ destination​ ​ of​ ​ a ​ ​ new​ ​ user​ ​ based​ ​ on​ ​ his personalized​ ​ content​

  • Objective :

    To predict top 5 travel destinations in decreasing order of relevance

  • Dataset -- Train_users.csv

    There are 16 features used to describe each user in the dataset:

    • user id
    • the date of account creation
    • timestamp of the first activity, note that it can be earlier than
    • date of first booking
    • gender
    • age
    • signup_method
    • the page a user came to signup up from
    • international language preference
    • what kind of paid marketing
    • where the marketing is e.g. google, craigslist, other
    • whats the first marketing the user interacted with before the signing up
    • signup_app
    • first_device_type
    • first_browser

Phases

  • 1.Data preprocessing
  • 2.Exploratory Data Analysis
In [1]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
from sklearn.model_selection import train_test_split

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))
age_gender_bkts.csv
countries.csv
sample_submission.csv
sessions.csv
test_users.csv
train_users_2.csv

  1. Basic Data pre-processing :
    • 1(a) Loading the csv files into pandas data frame
    • 1(b) Dividing data into Train and Test data
    • 1(c) Data cleaning

1(a). Loading the csv files into pandas data frame

In [2]:
#  data from csv files is imported to  pandas data frames
data_train_org = pd.read_csv("../input/train_users_2.csv")
print(data_train_org.columns)
data_train_org=data_train_org.sort_values(by='timestamp_first_active')
Index(['id', 'date_account_created', 'timestamp_first_active',
       'date_first_booking', 'gender', 'age', 'signup_method', 'signup_flow',
       'language', 'affiliate_channel', 'affiliate_provider',
       'first_affiliate_tracked', 'signup_app', 'first_device_type',
       'first_browser', 'country_destination'],
      dtype='object')
In [3]:
print(data_train_org.shape)
(213451, 16)

1(b).Dividing the data into train data and test data

In [4]:
data_train, data_test = train_test_split(data_train_org, test_size=0.2)
data_train_copy = data_train
print("%d items in training data, %d in test data" % (len(data_train), len(data_test)))
170760 items in training data, 42691 in test data
In [5]:
# Removing the data_first_booking column from data_train , data_test
print(data_train.columns)
data_train.drop('date_first_booking',1)
data_test.drop('date_first_booking',1)
data_train=data_train.sort_values(by='timestamp_first_active')
data_test=data_train.sort_values(by='timestamp_first_active')
Index(['id', 'date_account_created', 'timestamp_first_active',
       'date_first_booking', 'gender', 'age', 'signup_method', 'signup_flow',
       'language', 'affiliate_channel', 'affiliate_provider',
       'first_affiliate_tracked', 'signup_app', 'first_device_type',
       'first_browser', 'country_destination'],
      dtype='object')

1(c).Data Cleaning :

  • Replace the gender , age values which are not present to 'NaN'
In [6]:
#replacing gender and age values which are not present to Nan
data_train.gender.replace('-unknown-',np.nan, inplace=True)
data_test.gender.replace('-unknown-',np.nan, inplace=True)
data_train.age.replace('NaN', np.nan, inplace=True)
data_test.age.replace('NaN',np.nan, inplace=True)
print(data_train.head())
           id date_account_created  timestamp_first_active date_first_booking  \
0  gxn3p5htnn           2010-06-28          20090319043255                NaN   
1  820tgsjxq7           2011-05-25          20090523174809                NaN   
2  4ft3gnwmtx           2010-09-28          20090609231247         2010-08-02   
3  bjjt8pjhuk           2011-12-05          20091031060129         2012-09-08   
5  osr2jwljor           2010-01-01          20100101215619         2010-01-02   

   gender   age signup_method  signup_flow language affiliate_channel  \
0     NaN   NaN      facebook            0       en            direct   
1    MALE  38.0      facebook            0       en               seo   
2  FEMALE  56.0         basic            3       en            direct   
3  FEMALE  42.0      facebook            0       en            direct   
5     NaN   NaN         basic            0       en             other   

  affiliate_provider first_affiliate_tracked signup_app first_device_type  \
0             direct               untracked        Web       Mac Desktop   
1             google               untracked        Web       Mac Desktop   
2             direct               untracked        Web   Windows Desktop   
3             direct               untracked        Web       Mac Desktop   
5              other                     omg        Web       Mac Desktop   

  first_browser country_destination  
0        Chrome                 NDF  
1        Chrome                 NDF  
2            IE                  US  
3       Firefox               other  
5        Chrome                  US  
In [7]:
import missingno as msno
msno.matrix(data_train)

Above plot shows lot of missing values in gender , age .

2.Exploratory Data Analysis

  • 2(a) .Univariate analysis
  • 2(b). Bivariate analysis
  • 2(c). Multivariate analysis

2(a) . Univariate analysis

In [8]:
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
destination_percentage = data_train.country_destination.value_counts() / data_train.shape[0] * 100
destination_percentage.plot(kind='bar',color='#3498DB')
plt.xlabel('Destination Country')
plt.ylabel('Percentage')
sns.despine()
  • Observations :
    1 .  57% of users in Train data set did not travel anywhere .
    2 .  28 % of users travelled in their home country i.e ..,U.S .
In [9]:
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
gender_percentage = data_train.gender.value_counts() / data_train.shape[0] * 100
gender_percentage.plot(kind='bar',color='#D35400')
plt.xlabel('Gender of users')
plt.ylabel('Percentage')
sns.despine()
  • Observations :
    1 .  45 % of user's gender information is not present . 
    2 .  There is less difference between Female and Male users.
In [10]:
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
device_percentage = data_train.first_device_type.value_counts() / data_train.shape[0] * 100
device_percentage.plot(kind='bar',color='#196F3D')
plt.xlabel('Device used by user')
plt.ylabel('Percentage')
sns.despine()
  • Observations :
    1 .  58% users are using Apple products .
    2 . Out of 71,719 users who travelled atleast once,31660 users are apple users [ 44.15% ] which implies Mac users are booking more frequently .
In [11]:
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
sns.distplot(data_train.age.dropna(), color='#16A085')
plt.xlabel('PDF of Age')
sns.despine()
  • Observations :
    1 .  Some age values are incorrect, like close to 2000 , so cleaning such data[ 0.0035% ]
In [12]:
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
data_train['age']=data_train['age'].apply(lambda x : 36 if x>100 else x)
sns.distplot(data_train.age.dropna(), color='#16A085')
plt.xlabel('PDF of Age')
sns.despine()
  • Observations :
    1 .  Majority of the users are between age 25 and 40 years . [ 72% ]
    2 .  There are some age values which are less than 18 years [ 0.006% ](not allowed)
In [13]:
data_train['date_account_created_new'] = pd.to_datetime(data_train['date_account_created'])
data_train['date_first_active_new'] = pd.to_datetime((data_train.timestamp_first_active // 1000000), format='%Y%m%d')
data_train['date_account_created_day'] = data_train.date_account_created_new.dt.weekday_name
data_train['date_account_created_month'] = data_train.date_account_created_new.dt.month
data_train['date_account_created_year'] = data_train.date_account_created_new.dt.year
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
data_without_NDF = data_train[data_train['country_destination']!='US']
data_without_NDF1= data_without_NDF[data_without_NDF['country_destination']!='NDF']
sns.countplot(x='date_account_created_day',data=data_train)
plt.xlabel('Day wise')
plt.ylabel('Number of users')
sns.despine()
  • Observations :
    1 . User activity is low on saturday and sunday . So chance of booking on saturdays , sundays is pretty low .
In [14]:
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
destination_percentage = data_train.language.value_counts() / data_train.shape[0] * 100
destination_percentage.plot(kind='bar',color='#3498DB')
plt.xlabel('Destination Country')
plt.ylabel('Percentage')
sns.despine()
  • Observations :
    1  .  Majority of the user's language preference is English (96.67%) . But it is still qu-estionable because most of users are from US 
    2 .   Predicting geo location of users based on language preference may be useful .

2(b) . Bivariate analysis

In [ ]:
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
data_without_NDF = data_train[data_train['country_destination']!='US']
data_without_NDF1= data_without_NDF[data_without_NDF['country_destination']!='NDF']
data_train['booked'] = data_train.country_destination.apply(lambda x:1 if x!='NDF' else 0 )
destination_percentage = data_train.groupby(['date_account_created_year','date_account_created_month']).booked.sum() / data_train.shape[0] * 100
destination_percentage.plot(kind='bar',color="#F4D03F")
plt.xlabel('Year wise - each month Travel count')
plt.ylabel('Percentage')
sns.despine()
  • Observations :
    1  .  Every year it is following almost same trend like in 7,8,9 months the chance of    booking is high .
In [16]:
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
data_without_NDF = data_train[data_train['country_destination']!='US']
data_without_NDF1= data_without_NDF[data_without_NDF['country_destination']!='NDF']
sns.countplot(x='country_destination', hue='signup_app',data=data_without_NDF1)
plt.xlabel('Destination Country based on signup app')
plt.ylabel('Number of users')
sns.despine()
  • Observations :
    1 .  Users with signup app 'Web' outnumbered other signup app like Moweb , iOS , Android in every country . [ 85% ]
    2 . Android is the least used signup app [ 0.02% ]
In [17]:
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
data_without_NDF = data_train[data_train['country_destination']!='US']
data_without_NDF1= data_without_NDF[data_without_NDF['country_destination']!='NDF']
sns.countplot(x='country_destination', hue='signup_method',data=data_without_NDF1)
plt.xlabel('Destination Country based on signup method ( removed NDF,US )')
plt.ylabel('Number of Users')
sns.despine()
  • Observations :
    1 .  There are less signups (almost negligible) happening by google signup compared to facebook,basic signup(0.03%)
    2 .  Basic signup count is almost 2.5 times facebook signup count .
In [18]:
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
affiliate_provider_percentage = data_train.affiliate_provider.value_counts() / data_train.shape[0] * 100
affiliate_provider_percentage.plot(kind='bar',color='#CB4335')
plt.xlabel('Percentage of users based on affiliate providers ')
plt.ylabel('Percentage')
sns.despine()
  • Observations :
    1 . In previous plot we have observed that percentage of users signing up with google are 0.03% , but in this plot we  observe that most of users are coming  from google (22%)
In [19]:
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
data_without_NDF = data_train[data_train['country_destination']!='US']
data_without_NDF1= data_without_NDF[data_without_NDF['country_destination']!='NDF']
sns.boxplot(y='age' , x='country_destination',data=data_without_NDF1)
plt.xlabel('Destination Country box plot ( removed NDF,US )')
plt.ylabel('Age of Users')
sns.despine()
  • Observations :
    1 .   Users booking for countries Spain , Portugal and Netherlands tend to be younger     where as Users booking for Great Britain tend to be older .
In [20]:
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(18.7, 12.27)
data_train.date_account_created_new.value_counts().plot(kind='line', linewidth=1.2, color='#1F618D')
plt.xlabel('Date account created line plot ')
sns.despine()

Source

  • Observations :
    1 . Every year  beween September and October there is increase in Activity of users on Airbnb .
    2 . Basic study on this lead to interesting phenomena that users are trying to book for  Superbowl , Labor day.
In [ ]:
 

2(c).Multi variate analysis :

In [21]:
from sklearn.preprocessing import LabelEncoder
df_all = data_train_copy
print(df_all.columns)
df_all = df_all.drop(['id', 'date_first_booking'], axis=1)
df_all = df_all.fillna(-1)
dac = np.vstack(df_all.date_account_created.astype(str).apply(lambda x: list(map(int, x.split('-')))).values)
df_all['dac_year'] = dac[:,0]
df_all['dac_month'] = dac[:,1]
df_all['dac_day'] = dac[:,2]
df_all = df_all.drop(['date_account_created'], axis=1)
tfa = np.vstack(df_all.timestamp_first_active.astype(str).apply(lambda x: list(map(int, [x[:4],x[4:6],x[6:8],x[8:10],x[10:12],x[12:14]]))).values)
df_all['tfa_year'] = tfa[:,0]
df_all['tfa_month'] = tfa[:,1]
df_all['tfa_day'] = tfa[:,2]
df_all = df_all.drop(['timestamp_first_active'], axis=1)
av = df_all.age.values
df_all['age'] = np.where(np.logical_or(av<14, av>100), -1, av)
ohe_feats = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser']
Index(['id', 'date_account_created', 'timestamp_first_active',
       'date_first_booking', 'gender', 'age', 'signup_method', 'signup_flow',
       'language', 'affiliate_channel', 'affiliate_provider',
       'first_affiliate_tracked', 'signup_app', 'first_device_type',
       'first_browser', 'country_destination'],
      dtype='object')
In [22]:
df_all  = data_train_copy
piv_train = data_train.shape[0]
df_all = df_all.drop(['id', 'date_first_booking'], axis=1)
df_all = df_all.fillna(-1)
dac = np.vstack(df_all.date_account_created.astype(str).apply(lambda x: list(map(int, x.split('-')))).values)
df_all['dac_year'] = dac[:,0]
df_all['dac_month'] = dac[:,1]
df_all['dac_day'] = dac[:,2]
df_all = df_all.drop(['date_account_created'], axis=1)

tfa = np.vstack(df_all.timestamp_first_active.astype(str).apply(lambda x: list(map(int, [x[:4],x[4:6],x[6:8],x[8:10],x[10:12],x[12:14]]))).values)
df_all['tfa_year'] = tfa[:,0]
df_all['tfa_month'] = tfa[:,1]
df_all['tfa_day'] = tfa[:,2]
df_all = df_all.drop(['timestamp_first_active'], axis=1)
av = df_all.age.values
df_all['age'] = np.where(np.logical_or(av<14, av>100), -1, av)

Apply One Hot encoding

In [23]:
for f in ohe_feats:
    df_all_dummy = pd.get_dummies(df_all[f], prefix=f)
    df_all = df_all.drop([f], axis=1)
    df_all = pd.concat((df_all, df_all_dummy), axis=1)
vals = df_all.values
piv_train = df_all.shape[0]
X = vals[:piv_train]
le = LabelEncoder()
labels = df_all['country_destination'].values
df_all = df_all.drop(['country_destination'], axis=1)
y = le.fit_transform(labels)   
X_test = vals[piv_train:]
df_all
Out[23]:
age dac_year dac_month dac_day tfa_year tfa_month tfa_day gender_-unknown- gender_FEMALE gender_MALE ... first_browser_SeaMonkey first_browser_Silk first_browser_SiteKiosk first_browser_SlimBrowser first_browser_Sogou Explorer first_browser_Stainless first_browser_TenFourFox first_browser_TheWorld Browser first_browser_Yandex.Browser first_browser_wOSBrowser
202218 33.0 2014 6 10 2014 6 10 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
62183 21.0 2013 2 25 2013 2 25 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
126540 -1.0 2013 11 21 2013 11 21 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
53005 -1.0 2012 12 20 2012 12 20 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
184833 37.0 2014 5 5 2014 5 5 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
75359 30.0 2013 5 7 2013 5 7 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
76819 33.0 2013 5 13 2013 5 13 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
70305 -1.0 2013 4 11 2013 4 11 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
174448 -1.0 2014 4 11 2014 4 11 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
6226 -1.0 2011 7 6 2011 7 6 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
40843 30.0 2012 9 13 2012 9 13 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
213105 -1.0 2014 6 30 2014 6 30 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
11926 -1.0 2011 11 1 2011 11 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
9845 28.0 2011 9 20 2011 9 20 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
188405 26.0 2014 5 13 2014 5 13 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
104469 23.0 2013 9 3 2013 9 3 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
92825 27.0 2013 7 21 2013 7 21 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
106936 27.0 2013 9 12 2013 9 12 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
38814 36.0 2012 8 30 2012 8 30 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
88145 -1.0 2013 7 2 2013 7 2 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
95891 32.0 2013 8 2 2013 8 2 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
52396 44.0 2012 12 15 2012 12 15 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
150634 -1.0 2014 2 8 2014 2 8 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
88239 -1.0 2013 7 3 2013 7 3 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
56828 -1.0 2013 1 21 2013 1 21 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
1776 -1.0 2010 9 17 2010 9 17 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
148267 -1.0 2014 2 1 2014 2 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
139568 60.0 2014 1 9 2014 1 9 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
103099 27.0 2013 8 28 2013 8 28 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
182219 60.0 2014 4 29 2014 4 29 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
180979 22.0 2014 4 27 2014 4 27 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
105397 -1.0 2013 9 6 2013 9 6 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
80977 38.0 2013 6 1 2013 6 1 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
128929 -1.0 2013 12 2 2013 12 2 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
180060 -1.0 2014 4 24 2014 4 24 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
96698 37.0 2013 8 5 2013 8 5 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
205755 27.0 2014 6 17 2014 6 17 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
5822 -1.0 2011 6 20 2011 6 20 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
118158 -1.0 2013 10 20 2013 10 20 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
137997 -1.0 2014 1 4 2014 1 4 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
501 -1.0 2010 5 1 2010 5 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
208780 -1.0 2014 6 22 2014 6 22 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
194020 -1.0 2014 5 23 2014 5 23 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
37796 -1.0 2012 8 23 2012 8 23 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
108950 46.0 2013 9 19 2013 9 19 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
99578 -1.0 2013 8 15 2013 8 15 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
179878 27.0 2014 4 24 2014 4 24 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
190479 -1.0 2014 5 16 2014 5 16 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
209876 -1.0 2014 6 24 2014 6 24 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
132760 33.0 2013 12 16 2013 12 16 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
120275 37.0 2013 10 29 2013 10 29 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
183 -1.0 2010 3 4 2010 3 4 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
140770 -1.0 2014 1 12 2014 1 12 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
210070 25.0 2014 6 24 2014 6 24 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
99181 26.0 2013 8 14 2013 8 14 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
24338 23.0 2012 5 16 2012 5 16 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
51300 35.0 2012 12 6 2012 12 6 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
194248 -1.0 2014 5 24 2014 5 24 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
88936 -1.0 2013 7 6 2013 7 6 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
7039 -1.0 2011 7 29 2011 7 29 1 0 0 ... 0 0 0 0 0 0 0 0 0 0

170760 rows × 153 columns

In [24]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

PCA Implementation

In [25]:
%pylab inline
#scatter(X_pca[:, 0], X_pca[:, 1], c)
pylab.rcParams['figure.figsize'] = (15, 11)
X_pca = PCA().fit_transform(df_all)
scatter(X_pca[:, 0], X_pca[:, 1], c=y,cmap=plt.cm.spectral,alpha=.4,
        edgecolor='k')
plt.show()
Populating the interactive namespace from numpy and matplotlib
/opt/conda/lib/python3.6/site-packages/IPython/core/magics/pylab.py:161: UserWarning: pylab import has clobbered these variables: ['f']
`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"

T-sne Implementation

In [ ]:
%pylab inline
pylab.rcParams['figure.figsize'] = (15, 11)
X_tsne = TSNE(n_iter=251,learning_rate=100,verbose=2).fit_transform(df_all)
scatter(X_tsne[:, 0], X_tsne[:, 1], c=y,cmap=plt.cm.spectral,alpha=.4,
        edgecolor='k')
plt.show()
Populating the interactive namespace from numpy and matplotlib
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 170760 samples in 3.845s...
[t-SNE] Computed neighbors for 170760 samples in 671.686s...
[t-SNE] Computed conditional probabilities for sample 1000 / 170760
[t-SNE] Computed conditional probabilities for sample 2000 / 170760
[t-SNE] Computed conditional probabilities for sample 3000 / 170760
[t-SNE] Computed conditional probabilities for sample 4000 / 170760
[t-SNE] Computed conditional probabilities for sample 5000 / 170760
[t-SNE] Computed conditional probabilities for sample 6000 / 170760
[t-SNE] Computed conditional probabilities for sample 7000 / 170760
[t-SNE] Computed conditional probabilities for sample 8000 / 170760
[t-SNE] Computed conditional probabilities for sample 9000 / 170760
[t-SNE] Computed conditional probabilities for sample 10000 / 170760
[t-SNE] Computed conditional probabilities for sample 11000 / 170760
[t-SNE] Computed conditional probabilities for sample 12000 / 170760
[t-SNE] Computed conditional probabilities for sample 13000 / 170760
[t-SNE] Computed conditional probabilities for sample 14000 / 170760
[t-SNE] Computed conditional probabilities for sample 15000 / 170760
[t-SNE] Computed conditional probabilities for sample 16000 / 170760
[t-SNE] Computed conditional probabilities for sample 17000 / 170760
[t-SNE] Computed conditional probabilities for sample 18000 / 170760
[t-SNE] Computed conditional probabilities for sample 19000 / 170760
[t-SNE] Computed conditional probabilities for sample 20000 / 170760
[t-SNE] Computed conditional probabilities for sample 21000 / 170760
[t-SNE] Computed conditional probabilities for sample 22000 / 170760
[t-SNE] Computed conditional probabilities for sample 23000 / 170760
[t-SNE] Computed conditional probabilities for sample 24000 / 170760
[t-SNE] Computed conditional probabilities for sample 25000 / 170760
[t-SNE] Computed conditional probabilities for sample 26000 / 170760
[t-SNE] Computed conditional probabilities for sample 27000 / 170760
[t-SNE] Computed conditional probabilities for sample 28000 / 170760
[t-SNE] Computed conditional probabilities for sample 29000 / 170760
[t-SNE] Computed conditional probabilities for sample 30000 / 170760
[t-SNE] Computed conditional probabilities for sample 31000 / 170760
[t-SNE] Computed conditional probabilities for sample 32000 / 170760
[t-SNE] Computed conditional probabilities for sample 33000 / 170760
[t-SNE] Computed conditional probabilities for sample 34000 / 170760
[t-SNE] Computed conditional probabilities for sample 35000 / 170760
[t-SNE] Computed conditional probabilities for sample 36000 / 170760
[t-SNE] Computed conditional probabilities for sample 37000 / 170760
[t-SNE] Computed conditional probabilities for sample 38000 / 170760
[t-SNE] Computed conditional probabilities for sample 39000 / 170760
[t-SNE] Computed conditional probabilities for sample 40000 / 170760
[t-SNE] Computed conditional probabilities for sample 41000 / 170760
[t-SNE] Computed conditional probabilities for sample 42000 / 170760
[t-SNE] Computed conditional probabilities for sample 43000 / 170760
[t-SNE] Computed conditional probabilities for sample 44000 / 170760
[t-SNE] Computed conditional probabilities for sample 45000 / 170760
[t-SNE] Computed conditional probabilities for sample 46000 / 170760
[t-SNE] Computed conditional probabilities for sample 47000 / 170760
[t-SNE] Computed conditional probabilities for sample 48000 / 170760
[t-SNE] Computed conditional probabilities for sample 49000 / 170760
[t-SNE] Computed conditional probabilities for sample 50000 / 170760
[t-SNE] Computed conditional probabilities for sample 51000 / 170760
[t-SNE] Computed conditional probabilities for sample 52000 / 170760
[t-SNE] Computed conditional probabilities for sample 53000 / 170760
[t-SNE] Computed conditional probabilities for sample 54000 / 170760
[t-SNE] Computed conditional probabilities for sample 55000 / 170760
[t-SNE] Computed conditional probabilities for sample 56000 / 170760
[t-SNE] Computed conditional probabilities for sample 57000 / 170760
[t-SNE] Computed conditional probabilities for sample 58000 / 170760
[t-SNE] Computed conditional probabilities for sample 59000 / 170760
[t-SNE] Computed conditional probabilities for sample 60000 / 170760
[t-SNE] Computed conditional probabilities for sample 61000 / 170760
[t-SNE] Computed conditional probabilities for sample 62000 / 170760
[t-SNE] Computed conditional probabilities for sample 63000 / 170760
[t-SNE] Computed conditional probabilities for sample 64000 / 170760
[t-SNE] Computed conditional probabilities for sample 65000 / 170760
[t-SNE] Computed conditional probabilities for sample 66000 / 170760
[t-SNE] Computed conditional probabilities for sample 67000 / 170760
[t-SNE] Computed conditional probabilities for sample 68000 / 170760
[t-SNE] Computed conditional probabilities for sample 69000 / 170760
[t-SNE] Computed conditional probabilities for sample 70000 / 170760
[t-SNE] Computed conditional probabilities for sample 71000 / 170760
[t-SNE] Computed conditional probabilities for sample 72000 / 170760
[t-SNE] Computed conditional probabilities for sample 73000 / 170760
[t-SNE] Computed conditional probabilities for sample 74000 / 170760
[t-SNE] Computed conditional probabilities for sample 75000 / 170760
[t-SNE] Computed conditional probabilities for sample 76000 / 170760
[t-SNE] Computed conditional probabilities for sample 77000 / 170760
[t-SNE] Computed conditional probabilities for sample 78000 / 170760
[t-SNE] Computed conditional probabilities for sample 79000 / 170760
[t-SNE] Computed conditional probabilities for sample 80000 / 170760
[t-SNE] Computed conditional probabilities for sample 81000 / 170760
[t-SNE] Computed conditional probabilities for sample 82000 / 170760
[t-SNE] Computed conditional probabilities for sample 83000 / 170760
[t-SNE] Computed conditional probabilities for sample 84000 / 170760
[t-SNE] Computed conditional probabilities for sample 85000 / 170760
[t-SNE] Computed conditional probabilities for sample 86000 / 170760
[t-SNE] Computed conditional probabilities for sample 87000 / 170760
[t-SNE] Computed conditional probabilities for sample 88000 / 170760
[t-SNE] Computed conditional probabilities for sample 89000 / 170760
[t-SNE] Computed conditional probabilities for sample 90000 / 170760
[t-SNE] Computed conditional probabilities for sample 91000 / 170760
[t-SNE] Computed conditional probabilities for sample 92000 / 170760
[t-SNE] Computed conditional probabilities for sample 93000 / 170760
[t-SNE] Computed conditional probabilities for sample 94000 / 170760
[t-SNE] Computed conditional probabilities for sample 95000 / 170760
[t-SNE] Computed conditional probabilities for sample 96000 / 170760
[t-SNE] Computed conditional probabilities for sample 97000 / 170760
[t-SNE] Computed conditional probabilities for sample 98000 / 170760
[t-SNE] Computed conditional probabilities for sample 99000 / 170760
[t-SNE] Computed conditional probabilities for sample 100000 / 170760
[t-SNE] Computed conditional probabilities for sample 101000 / 170760
[t-SNE] Computed conditional probabilities for sample 102000 / 170760
[t-SNE] Computed conditional probabilities for sample 103000 / 170760
[t-SNE] Computed conditional probabilities for sample 104000 / 170760
[t-SNE] Computed conditional probabilities for sample 105000 / 170760
[t-SNE] Computed conditional probabilities for sample 106000 / 170760
[t-SNE] Computed conditional probabilities for sample 107000 / 170760
[t-SNE] Computed conditional probabilities for sample 108000 / 170760
[t-SNE] Computed conditional probabilities for sample 109000 / 170760
[t-SNE] Computed conditional probabilities for sample 110000 / 170760
[t-SNE] Computed conditional probabilities for sample 111000 / 170760
[t-SNE] Computed conditional probabilities for sample 112000 / 170760
[t-SNE] Computed conditional probabilities for sample 113000 / 170760
[t-SNE] Computed conditional probabilities for sample 114000 / 170760
[t-SNE] Computed conditional probabilities for sample 115000 / 170760
[t-SNE] Computed conditional probabilities for sample 116000 / 170760
[t-SNE] Computed conditional probabilities for sample 117000 / 170760
[t-SNE] Computed conditional probabilities for sample 118000 / 170760
[t-SNE] Computed conditional probabilities for sample 119000 / 170760
[t-SNE] Computed conditional probabilities for sample 120000 / 170760
[t-SNE] Computed conditional probabilities for sample 121000 / 170760
[t-SNE] Computed conditional probabilities for sample 122000 / 170760
[t-SNE] Computed conditional probabilities for sample 123000 / 170760
[t-SNE] Computed conditional probabilities for sample 124000 / 170760
[t-SNE] Computed conditional probabilities for sample 125000 / 170760
[t-SNE] Computed conditional probabilities for sample 126000 / 170760
[t-SNE] Computed conditional probabilities for sample 127000 / 170760
[t-SNE] Computed conditional probabilities for sample 128000 / 170760
[t-SNE] Computed conditional probabilities for sample 129000 / 170760
[t-SNE] Computed conditional probabilities for sample 130000 / 170760
[t-SNE] Computed conditional probabilities for sample 131000 / 170760
[t-SNE] Computed conditional probabilities for sample 132000 / 170760
[t-SNE] Computed conditional probabilities for sample 133000 / 170760
[t-SNE] Computed conditional probabilities for sample 134000 / 170760
[t-SNE] Computed conditional probabilities for sample 135000 / 170760
[t-SNE] Computed conditional probabilities for sample 136000 / 170760
[t-SNE] Computed conditional probabilities for sample 137000 / 170760
[t-SNE] Computed conditional probabilities for sample 138000 / 170760
[t-SNE] Computed conditional probabilities for sample 139000 / 170760
[t-SNE] Computed conditional probabilities for sample 140000 / 170760
[t-SNE] Computed conditional probabilities for sample 141000 / 170760
[t-SNE] Computed conditional probabilities for sample 142000 / 170760
[t-SNE] Computed conditional probabilities for sample 143000 / 170760
[t-SNE] Computed conditional probabilities for sample 144000 / 170760
[t-SNE] Computed conditional probabilities for sample 145000 / 170760
[t-SNE] Computed conditional probabilities for sample 146000 / 170760
[t-SNE] Computed conditional probabilities for sample 147000 / 170760
[t-SNE] Computed conditional probabilities for sample 148000 / 170760
[t-SNE] Computed conditional probabilities for sample 149000 / 170760
[t-SNE] Computed conditional probabilities for sample 150000 / 170760
[t-SNE] Computed conditional probabilities for sample 151000 / 170760
[t-SNE] Computed conditional probabilities for sample 152000 / 170760
[t-SNE] Computed conditional probabilities for sample 153000 / 170760
[t-SNE] Computed conditional probabilities for sample 154000 / 170760
[t-SNE] Computed conditional probabilities for sample 155000 / 170760
[t-SNE] Computed conditional probabilities for sample 156000 / 170760
[t-SNE] Computed conditional probabilities for sample 157000 / 170760
[t-SNE] Computed conditional probabilities for sample 158000 / 170760
[t-SNE] Computed conditional probabilities for sample 159000 / 170760
[t-SNE] Computed conditional probabilities for sample 160000 / 170760
[t-SNE] Computed conditional probabilities for sample 161000 / 170760
[t-SNE] Computed conditional probabilities for sample 162000 / 170760
[t-SNE] Computed conditional probabilities for sample 163000 / 170760
[t-SNE] Computed conditional probabilities for sample 164000 / 170760
[t-SNE] Computed conditional probabilities for sample 165000 / 170760
[t-SNE] Computed conditional probabilities for sample 166000 / 170760
[t-SNE] Computed conditional probabilities for sample 167000 / 170760
[t-SNE] Computed conditional probabilities for sample 168000 / 170760
[t-SNE] Computed conditional probabilities for sample 169000 / 170760
[t-SNE] Computed conditional probabilities for sample 170000 / 170760
[t-SNE] Computed conditional probabilities for sample 170760 / 170760
[t-SNE] Mean sigma: 0.000000
[t-SNE] Computed conditional probabilities in 12.639s
[t-SNE] Iteration 50: error = 123.7767487, gradient norm = 0.0000001 (50 iterations in 302.532s)
[t-SNE] Iteration 50: gradient norm 0.000000. Finished.
[t-SNE] KL divergence after 50 iterations with early exaggeration: 123.776749