Crisp overview of the dataset

Accelerometer and Gyroscope readings are taken from 30 volunteers(referred as subjects) while performing the following 6 Activities. 1) Walking 2) WalkingUpstairs 3) WalkingDownstairs 4) Standing 5) Sitting 6) Lying.
Readings are divided into a window of 2.56 seconds with 50% overlapping.
Accelerometer readings are divided into gravity acceleration and body acceleration readings, which has x,y and z components each.
Gyroscope readings are the measure of angular velocities which has x,y and z components.
Jerk signals are calculated for BodyAcceleration readings.
Fourier Transforms are made on the above time readings to obtain frequency readings.
Now, on all the base signal readings., mean, max, mad, sma, arcoefficient, engerybands,entropy etc., are calculated for each window.
We get a feature vector of 561 features and these features are given in the dataset.
Each window of readings is a datapoint of 561 features.

Problem Framework

30 subjects(volunteers) data is randomly split to 70%(21) test and 30%(7) train data.
Each datapoint corresponds one of the 6 Activities.

Problem Statement

Given a new datapoint we have to predict the Activity it belongs to.

# All the necessary imports
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# features are loaded from the features file given
features = list()
with open('/home/codename_sai/Desktop/1_HumanActivityRecognition/UCI HAR Dataset/features.txt') as f:
    features = [line.split()[1] for line in f.readlines()]

print("No of features : {}".format(features.__len__()))
print(features[:10])

No of features : 561
['tBodyAcc-mean()-X', 'tBodyAcc-mean()-Y', 'tBodyAcc-mean()-Z', 'tBodyAcc-std()-X', 'tBodyAcc-std()-Y', 'tBodyAcc-std()-Z', 'tBodyAcc-mad()-X', 'tBodyAcc-mad()-Y', 'tBodyAcc-mad()-Z', 'tBodyAcc-max()-X']

datapath = '/home/codename_sai/Desktop/1_HumanActivityRecognition/UCI HAR Dataset/'

# Train data is loaded from the X_train and y_train files given
X_train = pd.read_csv(datapath + 'train/X_train.txt', sep = '\s+',names=features)

# add subject column to the dataframe
X_train['Subject'] = pd.read_csv(datapath + 'train/subject_train.txt')

y_train = pd.read_csv(datapath + 'train/y_train.txt', sep = '\s+',names=['Activity'])

# Test data is loaded from the X_test and y_test files given
X_test = pd.read_csv(datapath + 'test/X_test.txt', sep = '\s+',names=features)

# add subject column to the dataframe
X_test['Subject'] = pd.read_csv(datapath + 'test/subject_test.txt')

y_test = pd.read_csv(datapath + 'test/y_test.txt', sep = '\s+',names=['Activity'])

# train data has 7352 datapoints and 562 features
X_train.shape

(7352, 562)

# 1 - Walking
# 2 - WalkingUpstairs
# 3 - WalkingDownstairs
# 4 - Standing
# 5 - Sitting
# 6 - Lying

y_train['Activity'].value_counts()

6    1407
5    1374
4    1286
1    1226
2    1073
3     986
Name: Activity, dtype: int64

y_train has 6 Activities. It is a multiclass classification
It is a reasonably balanced dataset

type(X_train), X_train.shape, type(y_train), y_train.shape

(pandas.core.frame.DataFrame,
 (7352, 562),
 pandas.core.frame.DataFrame,
 (7352, 1))

# Add class label to the X_train
X_train['Activity'] = y_train['Activity']

X_train.head()

y_train.head()

Data Cleaning - Dirtying hands

# Checking for duplicated
sum(X_train.duplicated()), sum(X_test.duplicated())

(0, 0)

There are no duplicates as we see from above

# checking for NaN/null values
X_train.isnull().values.any(), X_test.isnull().values.any()

(True, True)

There are NaN's, let's locate them

# Taking the rows which has NaN
nan_rows_train = X_train[X_train.isnull().T.any().T]
nan_rows_test = X_test[X_test.isnull().T.any().T]
print('NaN rows in train')
nan_rows_train

NaN rows in train

print('NaN rows in test')
nan_rows_test

NaN rows in test

Subject of 7351 indexed datapoint is missing. As it is 1/7352 we discard the complete row indexed 7351

# Here we get an extra feature for train as we have added class label to X_train
X_train.shape, X_test.shape

((7352, 563), (2947, 562))

# dropping the NaN/Null rows
X_train = X_train.dropna(how = 'any')
X_test = X_test.dropna(how = 'any')

# Drop the corresponding y_train and y_test rows
y_train = y_train.drop([7351])
y_test = y_test.drop([2946])

# Checking the shape after dropping Nan's
X_train.shape, X_test.shape

((7351, 563), (2946, 562))

# Cross Checking for the succesful drop of NaN
X_train.isnull().values.any(), X_test.isnull().values.any()

(False, False)

X_train.columns

Index(['tBodyAcc-mean()-X', 'tBodyAcc-mean()-Y', 'tBodyAcc-mean()-Z',
       'tBodyAcc-std()-X', 'tBodyAcc-std()-Y', 'tBodyAcc-std()-Z',
       'tBodyAcc-mad()-X', 'tBodyAcc-mad()-Y', 'tBodyAcc-mad()-Z',
       'tBodyAcc-max()-X',
       ...
       'fBodyBodyGyroJerkMag-kurtosis()', 'angle(tBodyAccMean,gravity)',
       'angle(tBodyAccJerkMean),gravityMean)',
       'angle(tBodyGyroMean,gravityMean)',
       'angle(tBodyGyroJerkMean,gravityMean)', 'angle(X,gravityMean)',
       'angle(Y,gravityMean)', 'angle(Z,gravityMean)', 'Subject', 'Activity'],
      dtype='object', length=563)

From above we can see that,

There are '()', '-' in column names, making the data less readable. Let's drop them.
We can also see 'BodyBody' occuring multiple times. Let's remove of the Body.

# Removing '()','-' from column names for readability
X_train.columns = X_train.columns.str.replace('[()]','')
X_train.columns = X_train.columns.str.replace('[-]','')
X_test.columns = X_test.columns.str.replace('[()]','')
X_test.columns = X_test.columns.str.replace('[-]','')

# Replacing BodyBody from column name to Body which is a typo error
X_train = X_train.rename(columns = lambda x : str(x).replace('BodyBody','Body'))
X_test = X_test.rename(columns = lambda x : str(x).replace('BodyBody','Body'))

# Adding 'ActivityNames' column for interpretability.
X_train['ActivityNames'] = X_train['Activity']

# Replacing a Activity numbers with Activity Names for interpretability
X_train.ActivityNames = X_train.ActivityNames.replace(1,'Walking')
X_train.ActivityNames = X_train.ActivityNames.replace(2,'WalingUpstairs')
X_train.ActivityNames = X_train.ActivityNames.replace(3,'WalkingDownstairs')
X_train.ActivityNames = X_train.ActivityNames.replace(4,'Sitting')
X_train.ActivityNames = X_train.ActivityNames.replace(5,'Standing')
X_train.ActivityNames = X_train.ActivityNames.replace(6,'Laying')

# Copying X_train to df just to avoid all the labour done above 
# incase of any change in df later
df = X_train

Featuring Engineering from Domain Knowledge - Dirtying mind

As we tend to learn about Accelerometer and Gyroscope and the features carefully we get following

1) In static activities (sit, stand, lie down) motion information will not be very useful.
2) In the dynamic activities (Walking, WalkingUpstairs,WalkingDownstairs) motion info will be significant.
3) Angle variables will be useful both in differentiating 'lie vs stand' and 'walk up vs walk down'.
4) Acceleration and Jerk variables are important in distinguishing various kinds of motion.
5) Magnitude and angle variables contain the same info as (strongly correlated with) XYZ variables, therefore we remove all the x,y,z component variables and retain Magnitude and angle variables.
6) We ignore the band variables as we have no simple way to interpret the meaning and relate them to physical activities.
7) Mean and std are important, skewness and kurtosis may also be hence we include all these.
8) We ignore the band variables as we have no simple way to interpret the meaning and relate them to physical activities.

After all the feature engineering and analysis we get following 38 features.

# final list of features
f = ['tBodyAccMagmean','tBodyAccMagstd','tBodyAccJerkMagmean','tBodyAccJerkMagstd','tBodyGyroMagmean',
     'tBodyGyroMagstd','tBodyGyroJerkMagmean','tBodyGyroJerkMagstd','fBodyAccMagmean','fBodyAccMagstd',
     'fBodyAccJerkMagmean','fBodyAccJerkMagstd','fBodyGyroMagmean','fBodyGyroMagstd','fBodyGyroJerkMagmean',
     'fBodyGyroJerkMagstd','fBodyGyroMagmeanFreq','fBodyGyroJerkMagmeanFreq','fBodyAccMagmeanFreq',
     'fBodyAccJerkMagmeanFreq','fBodyAccMagskewness','fBodyAccMagkurtosis','fBodyAccJerkMagskewness',
     'fBodyAccJerkMagkurtosis','fBodyGyroMagskewness','fBodyGyroMagkurtosis','fBodyGyroJerkMagskewness',
     'fBodyGyroJerkMagkurtosis','angletBodyAccJerkMean,gravityMean','angletBodyAccMean,gravity',
     'angletBodyGyroJerkMean,gravityMean','angletBodyGyroMean,gravityMean','angleX,gravityMean',
     'angleY,gravityMean','angleZ,gravityMean']

# we add 'Subject', Activity and 'ActivityNames' to the features
f1 = f + ['Activity','ActivityNames','Subject']

dfR = df[f]

dfR_test = X_test[f]

# Taking a new dataframe with reduced features
df = df[f1]

df.shape

(7351, 38)

# Just for readability of feature names we remove Body and Magnitude 
# and replace mean with 'Mean' and std with 'SD'
# tAccMean refers to tBodyAccMagMean
# fAccMean refers to fBodyAccMagMean
df = df.rename(columns = lambda x : str(x).replace('Body',''))
df = df.rename(columns = lambda x : str(x).replace('Mag',''))
df = df.rename(columns = lambda x : str(x).replace('mean','Mean'))
df = df.rename(columns = lambda x : str(x).replace('std','SD'))

# for plotting purposes taking datapoints of each activity to a different dataframe
df1 = df[df['Activity']==1]
df2 = df[df['Activity']==2]
df3 = df[df['Activity']==3]
df4 = df[df['Activity']==4]
df5 = df[df['Activity']==5]
df6 = df[df['Activity']==6]

sns.distplot(df1['tAccMean'],color = 'red',hist = True, kde = False,label = 'Walking')
sns.distplot(df2['tAccMean'],color = 'blue',hist = True, kde = False,label = 'Walking Up')
sns.distplot(df3['tAccMean'],color = 'green',hist = True, kde = False,label = 'Walking down')
sns.distplot(df4['tAccMean'],color = 'yellow',hist = True, kde = False,label = 'Sitting')
sns.distplot(df5['tAccMean'],color = 'm',hist = True, kde = False,label = 'Standing')
sns.distplot(df6['tAccMean'],color = 'orange',hist = True, kde = False,label = 'Laying')
sns.set_style("whitegrid")
plt.legend()
plt.tight_layout()
plt.show()

Histogram of Body Acceleration Magnitude Mean above is a predictor of static vs dynamic activities.
This is an example of data exploration in support of our heuristic variable selection using domain knowledge.

plt.subplot(2,2,1)
sns.distplot(df1['angletAccMean,gravity'],color = 'red',hist = True, kde = True,label = 'Walking')
sns.distplot(df2['angletAccMean,gravity'],color = 'blue',hist = True, kde = True,label = 'Walking Up')
sns.distplot(df3['angletAccMean,gravity'],color = 'green',hist = True, kde = True,label = 'Walking down')
plt.legend()
plt.subplot(2,2,2)
sns.distplot(df4['angletAccMean,gravity'],color = 'yellow',hist = True, kde = True,label = 'Sitting')
sns.distplot(df5['angletAccMean,gravity'],color = 'm',hist = True, kde = True,label = 'Standing')
sns.distplot(df6['angletAccMean,gravity'],color = 'orange',hist = True, kde = True,label = 'Laying')
sns.set_style("whitegrid")
plt.legend()
plt.tight_layout()
plt.show()

Variance of the Angle between the gravity and static activities is very less.
Variance of the Angle between the gravity and dynamic activities is more.
This can also be used to distinguish the static and dynamic activities

sns.boxplot(x='ActivityNames', y='tAccMean',data=df, showfliers=False, saturation=1)
# plt.savefig('pair.png')
plt.xticks(rotation=90)
plt.show()

If tAccMean is < -0.8 then the Activities are either Standing or Sitting or Laying.
If tAccMean is > -0.6 then the Activities are either Walking or WalkingDownstairs or WalkingUpstairs.
If tAccMean > 0.0 then the Activity is WalkingDownstairs.
We can classify 75% the Acitivity labels with some errors.

sns.boxplot(x='ActivityNames', y='angleX,gravityMean', data=df, showfliers=False)
plt.xticks(rotation = 40)
plt.show()

If angleX,gravityMean > 0 then Activity is Laying.
We can classify all datapoints belonging to Laying activity with just a single if else statement.

sns.boxplot(x='ActivityNames', y='angleY,gravityMean', data = df, showfliers=False)
plt.xticks(rotation = 40)
plt.show()

If angleY_gravityMean > 0.25 then the Activity is Walking Upstairs.
We can classify 75% of the datapoints belonging to WalkingUpsatirs Activity with some errors.

Let's see the whether the features we hand picked heuristically from the domain knowledge makes sense

Let's build a simple model with the reduced features and check it's preformance

Logistic Regression on reduced features

# import
from sklearn.linear_model import LogisticRegression

# instantiate LogisticRegression model
lgr = LogisticRegression()

lgr.fit(dfR,y_train)

/home/codename_sai/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:526: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

y_pred_lgr = lgr.predict(dfR_test)

from sklearn.metrics import accuracy_score

print(accuracy_score(y_test,y_pred_lgr))

0.888662593347

Using Logictic Regression we get an accuracy of 0.88 with the reduced feature set which we have taken with the help of domain knowledge.
Therefore our feature engineering done using domain knowledge proved to be correct.

Random forests on reduced features

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=200)
clf.fit(dfR,y_train)
pred = clf.predict(dfR_test)

/home/codename_sai/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

print(accuracy_score(y_test,pred))

0.890020366599

Using Random Forests we get an accuracy of 0.89 with the reduced feature set which we have taken with the help of domain knowledge.
Therefore our feature engineering done using domain knowledge proved to be correct.

Tsne on the original dataset

Let's do Tsne on the original dataset to visualze our data in lowe dimensions

X_train.shape

(7351, 564)

dft = X_train

del dft['Subject']
del dft['Activity']
del dft['ActivityNames']

dft.shape

(7351, 561)

# import
from sklearn.manifold import TSNE

# Taking [2,5,10,30,50] as perplexities to see which one gives converges better
l = [2,5,10,30,50]

# loop to calculate and display tsne for different perplexities 
X_tsne_p = []
for i in l:
    tsne1 = TSNE( perplexity = i)
    X1_tsne = tsne1.fit_transform(dft)
    a1 = X1_tsne
    a1 = a1.reshape((-1,2))
    d_1 = pd.DataFrame({'x':a1[:,0],'y':a1[:,1]})
    d_1['ActivityNames'] = df['ActivityNames']
    X_tsne_p.append(d_1)
    sns.FacetGrid(d_1,hue = 'ActivityNames',size = 8).map(plt.scatter,'x','y').add_legend()
    print("Tsne for perplexity = ", i)
    plt.show()

Tsne for perplexity =  2

Tsne for perplexity =  5

Tsne for perplexity =  10

Tsne for perplexity =  30

Tsne for perplexity =  50

It seems that for perplexity 50 data converges.
Let's invidually plot the tsne plot for perplexity = 50

# Tsne plot for perplexity = 50
sns.FacetGrid(X_tsne_p[4],hue = 'ActivityNames',size = 8).map(plt.scatter,'x','y').add_legend()
print("Tsne for perplexity = 50", )
plt.show()

Tsne for perplexity = 50

Conclusions

Laying Activity is clustered together, it can be classified by an hyperplane
Sitting and Standing are completely overlapping but, it seem that in higher dimensions it has an seperating hyperplane.
Walking Downstairs are almost clustered at one place except a very few.
Walking is also almost clustered together.
Walking upstairs is distributed randomly.
It gives us an insight that linear decision boundary does not clasify the datapoints effectively.
Non Linear decision boundaries might work well as the classification boundary between the classes looks both linear and non linear.
We can try Logistic Regression, despite of it being good at binary classification, we can give try as baseline model, as it is super fast.
We can also try Linear SVM.
Random forests and GBDT will do a fair amount of good job in classifying overlapping data and multiclass situation, let's try them aswell

	tBodyAcc-mean()-X	tBodyAcc-mean()-Y	tBodyAcc-mean()-Z	tBodyAcc-std()-X	tBodyAcc-std()-Y	tBodyAcc-std()-Z	tBodyAcc-mad()-X	tBodyAcc-mad()-Y	tBodyAcc-mad()-Z	tBodyAcc-max()-X	...	fBodyBodyGyroJerkMag-kurtosis()	angle(tBodyAccMean,gravity)	angle(tBodyAccJerkMean),gravityMean)	angle(tBodyGyroMean,gravityMean)	angle(tBodyGyroJerkMean,gravityMean)	angle(X,gravityMean)	angle(Y,gravityMean)	angle(Z,gravityMean)	Subject	Activity
0	0.288585	-0.020294	-0.132905	-0.995279	-0.983111	-0.913526	-0.995112	-0.983185	-0.923527	-0.934724	...	-0.710304	-0.112754	0.030400	-0.464761	-0.018446	-0.841247	0.179941	-0.058627	1.0	5
1	0.278419	-0.016411	-0.123520	-0.998245	-0.975300	-0.960322	-0.998807	-0.974914	-0.957686	-0.943068	...	-0.861499	0.053477	-0.007435	-0.732626	0.703511	-0.844788	0.180289	-0.054317	1.0	5
2	0.279653	-0.019467	-0.113462	-0.995380	-0.967187	-0.978944	-0.996520	-0.963668	-0.977469	-0.938692	...	-0.760104	-0.118559	0.177899	0.100699	0.808529	-0.848933	0.180637	-0.049118	1.0	5
3	0.279174	-0.026201	-0.123283	-0.996091	-0.983403	-0.990675	-0.997099	-0.982750	-0.989302	-0.938692	...	-0.482845	-0.036788	-0.012892	0.640011	-0.485366	-0.848649	0.181935	-0.047663	1.0	5
4	0.276629	-0.016570	-0.115362	-0.998139	-0.980817	-0.990482	-0.998321	-0.979672	-0.990441	-0.942469	...	-0.699205	0.123320	0.122542	0.693578	-0.615971	-0.847865	0.185151	-0.043892	1.0	5