Dataset from Netflix Prize Data competition to improve their recommendation system.
Original blog: [http://netflixprize.com/index.html]
Kaggle blog: [https://www.kaggle.com/netflix-inc/netflix-prize-data]
There are primarily 8 files when downloaded from kaggle
Filename: Movie_titles.csv
Format : (MovieID, Year, Title)
Description of features:
MovieID is an integer ranging from 1 to 17,770 and present in a sequential fashion. This does neither correspond with Imdb movie ID or Netflix movie ID.
Year is the Year of release of the corresponding DVD and may not be the theatrical release. It is an integer and ranges from 1890 to 2005.
Title is a the Netflix movie title in english. It’s in string format.
Filename: Combined_data_1.txt
Format: MovieID:
CustomerID,Score,Date
Description of features:
MovieID is same a in Movie_titles.csv
CustomerID is an integer ranging from 1 to 2649429. These are not the actual Netflix CustomerID. They have been changed due to privacy issues.
Score is the integral number of stars the user rated to that particular movie.
Date is the date of grading. It is in the format YYYY-MM-DD.
Filename: Qualifying.txt
Format: MovieID:
CustomerID,Date
Description of features:
MovieID is same a in Movie_titles.csv
CustomerID is an integer ranging from 1 to 2649429. These are not the actual Netflix CustomerID. They have been changed due to privacy issues.
Date is the date of grading. It is in the format YYYY-MM-DD.
Probe data is basically a subset of the training dataset. Format: MovieID: CustomerID
The data was present in a compressed fashion. From there it had to be converted to a tabular format. Upon doing so, the size became more than 3.5gb. Due to limited hardware it was impossible to work on the complete dataset. Therefore only the first file (combine_data_1.txt) was chosen
Note: Over the whole notebook, you'll find me deleting variables. This was done due to the limitations of hardware
I wrote a small script to change the file format so that it's easier to work with
'''
fopen = open("combined_data_1.txt",'r')
fwrite = open("Whole_dataset1.txt",'a+')
fopen.seek(0); fwrite.seek(0)
fwrite.writelines('CustomerID,Score,Date,MovieID\n')
for line in fopen.readlines():
line = line.strip()
if(line[-1] == ':'):
movie_id = line[:-1]
else:
fwrite.writelines(line+','+movie_id+'\n')
fopen.close(); fwrite.close()
'''
print
from sys import getsizeof
import pandas as pd
'''This Data analysis is done on a part of the data and not the whol
due to the limitations in the hardware'''
df = pd.read_csv('Whole_dataset1.txt')
df.columns
'''Showing 5 random samples'''
df.sample(n=5)
print ('Number of ratings: {}\nUnique Customers: {}\nNumber of movies: {}'\
.format(df.shape[0],len(set(df.CustomerID)),len(set(df.MovieID))))
So we can see there are 24,053,764 elements in the sparse matrix of size 470758x4499.
Therefore only 1.13% of the matrix is filled.
import numpy as np
np.where(df.isnull())
No missing values
df.duplicated().sum()
No duplicates in the dataset
import seaborn as sns
import matplotlib.pyplot as plt
'''Plotting the histogram of scores'''
plt.figure(figsize=(12,6))
plt.hist(x=list(df.Score),bins=[1,2,3,4,5,6],rwidth=0.9,normed=True,color='#0097D9');
plt.xticks([1.5,2.5,3.5,4.5,5.5],[1,2,3,4,5])
plt.title('Distribution of Scores',fontdict={'fontsize':30})
plt.xlabel('Score'); plt.ylabel('Percentage')
plt.savefig('Distribution_of_scores.png')
plt.show()
So we can see 4(~33%) stars is given the most, then 3(~28%) stars, then 5(~22.5%), then 2(~10%) and at last 1(~5%)
'''Loading the year of movies'''
year_of_movies = pd.read_csv('movie_info_formatted.txt').Year.tolist()
year_of_movies = year_of_movies[:4499]
While trying to plot the distribution of the year of movie releases, we found out that there were some missing data in year column in the movie_titles.csv file, also the titles of the movie had some encoding errors, so we had to drop the entire column during analysis.
'''Dropping the Null values in year'''
import math
l = len(year_of_movies); i = 0
while(i<l):
if(math.isnan(year_of_movies[i])):
year_of_movies.pop(i)
else:
i += 1
l = len(year_of_movies)
'''Plotting the distribution of number of movie released in the years'''
plt.figure(figsize=(12,6))
sns.distplot(year_of_movies,color='#8E44AD');
plt.title('Distribution of dates of movie release',fontdict={'fontsize':30})
plt.savefig('Distribution_of_movie_release.png')
plt.show()
We can conclude that most of the movies in our dataset had released during the 20's.
'''Retrieving the date of rating'''
date = df.Date.tolist()
'''Extracting the year from the full date'''
date = [i[:4] for i in date]
'''Creating a new feature called year'''
df['Year'] = date;
df.head(3)
min(date), max(date)
'''Calculating the percentage of ratings in a year'''
l = len(date); percentage = []
flag = list(set(date))
flag.sort()
for i in flag:
percentage.append(date.count(i)/l)
y = 1999
print("Year Percentage")
for i in range(len(percentage)):
print ('{}: {}'.format(y,percentage[i]*100))
y += 1
plt.figure(figsize=(15,6))
sns.barplot(x=flag,y=percentage)
plt.title('Year wise distribution of total ratings ',fontdict={'fontsize':30})
plt.xlabel('Year'); plt.ylabel('Percentage of ratings')
plt.savefig('Year_wise_distribution_of_total_ratings.png')
plt.show()
import pickle
In the cell below we are calculating every user's first rating year and saving it in a file
Note: No need to run the code again
'''
list_of_customer = sorted(list(set(df.CustomerID)))
year_1st_rating = [];
for i in list_of_customer:
date = min(list(df[df.CustomerID == i].Date))
year_1st_rating.append(date[:4])
with open("year_1st_rating.txt", "wb") as fp:
pickle.dump(year_1st_rating, fp)
fp.close()
'''
print()
In a pickle file we had stored 470,758 values, each value represents the year of 1st rating that a user gave. Computing this list was taking a lot of time so we computed it once and stored the result.
with open('year_1st_rating.txt','rb') as fp:
year_1st_rating = pickle.load(fp)
del fp
'''The values were originally in string format.
Therefore converting them in integer'''
year_1st_rating = list(map(int, year_1st_rating))
plt.figure(figsize=(14,7.5))
plt.hist(year_1st_rating,bins=[1999,2000,2001,2002,2003,2004,2005,2006]\
,rwidth=0.9,color='#650000',normed=True)
plt.xticks([1999.5,2000.5,2001.5,2002.5,2003.5,2004.5,2005.5],\
[1999,2000,2001,2002,2003,2004,2005])
plt.xlabel('Year');plt.ylabel('Percentage')
plt.title('Distribution of year of 1st ratings of Customers',fontdict={'fontsize':30})
plt.savefig('Distribution_year_1st_ratings_Customers.png')
plt.show()
Since Netflix doesn't disclose the joining date of the users we can consider the year in which the user rated a movie first as the joining year.
If we assume that, then we can conclude that netflix had the most number of users in 2004 and 2005.
del year_1st_rating
del date;
from collections import Counter
'''Finding the unique customers and the frequency of their rating'''
fre = dict(Counter(df.CustomerID))
import numpy as np
flag = np.array(list(fre.values()))
plt.figure(figsize=(12,6))
sns.distplot(flag,hist_kws={'cumulative': True},\
kde_kws={'cumulative': True},color='black')
plt.title('Cdf Number of movie watched',fontdict={'fontsize':30})
plt.savefig('Cdf_Number_of_movie_watched.png')
plt.show()
print ('50th percentile: {}\n75th percentile: {}\nMaximum: {}'\
.format(np.percentile(flag,50),np.percentile(flag,75),max(flag)))
We had an array of number of movies watched by the user, we took the 75th percentile and used it as a threshold to decide whether someone is a frequent customer or not.
'''If the user has rated more than 64 movie in the set of 4499
movies then he is frequent or else he is not'''
fvnf = []; tmp = np.percentile(flag,75)
for i in list(df.CustomerID):
if(fre[i]>tmp):
fvnf.append('F')
else:
fvnf.append('NF')
del flag
del fre
'''Creating a new column in the datframe to store Frequent/Non-frequent'''
df['Freq_nFreq'] = fvnf
del fvnf
'''Retrieving the distribution of scores given by frequent and non-frequent users'''
score_f = df[df.Freq_nFreq == 'F'].Score.tolist()
score_nf = df[df.Freq_nFreq == 'NF'].Score.tolist()
import matplotlib.patches as mpatches
'''Plotting the histogram of scores'''
plt.figure(figsize=(12,6))
plt.hist(x=[score_f,score_nf],bins=[1,2,3,4,5,6],rwidth=0.9,normed=True,color=['#900c3f','#ffc300']);
plt.xticks([1.5,2.5,3.5,4.5,5.5],[1,2,3,4,5])
plt.title('Comparison of Distribution of scores between Freq & Non-Freq users',fontdict={'fontsize':20})
plt.xlabel('Score'); plt.ylabel('Percentage')
red_patch = mpatches.Patch(color='#900c3f', label='Frequent')
black_patch = mpatches.Patch(color='#ffc300', label='Non-Frequent')
plt.legend(handles=[red_patch,black_patch])
plt.savefig('Comparison_of_Distribution_of_scores_freq_nfreq.png')
plt.show()
From here we can see that frequent reviewers are more critacal in their ratings as they give more percentage of 1,2 and 3 stars than 4 and 5 stars.
del score_f; del score_nf
'''Retriving the scores of 2005'''
score_2005 = df[df.Year == str(2005)].Score.tolist()
'''Plotting the histogram of scores in the year 2005
when netflix was the most popular in '''
plt.figure(figsize=(12,6))
plt.hist(x=score_2005,bins=[1,2,3,4,5,6],rwidth=0.9,normed=True,color='#15582A');
plt.xticks([1.5,2.5,3.5,4.5,5.5],[1,2,3,4,5])
plt.title('Distribution of Scores when year is 2005',fontdict={'fontsize':30})
plt.xlabel('Score'); plt.ylabel('Percentage')
plt.savefig('Distribution_of_scores_when_year2005.png')
plt.show()
del score_2005
So we can see that in 2005 distribution of scores was almost same as that of the total distribution
'''Retriving the scores of 2004'''
score_2004 = df[df.Year == str(2004)].Score.tolist()
'''Plotting the histogram of scores in the year 2004
'''
plt.figure(figsize=(12,6))
plt.hist(x=score_2004,bins=[1,2,3,4,5,6],rwidth=0.9,normed=True,color='#461B60');
plt.xticks([1.5,2.5,3.5,4.5,5.5],[1,2,3,4,5])
plt.title('Distribution of Scores when year is 2004',fontdict={'fontsize':30})
plt.xlabel('Score'); plt.ylabel('Percentage')
plt.savefig('Distribution_of_scores_when_year2004.png')
plt.show()
del score_2004
'''Retriving the scores of 2003'''
score_2003 = df[df.Year == str(2003)].Score.tolist()
'''Plotting the histogram of scores in the year 2003
when netflix was the most popular in '''
plt.figure(figsize=(12,6))
plt.hist(x=score_2003,bins=[1,2,3,4,5,6],rwidth=0.9,normed=True,color='#5D1515');
plt.xticks([1.5,2.5,3.5,4.5,5.5],[1,2,3,4,5])
plt.title('Distribution of Scores when year is 2003',fontdict={'fontsize':30})
plt.xlabel('Score'); plt.ylabel('Percentage')
plt.savefig('Distribution_of_scores_when_year2003.png')
plt.show()
del score_2003
'''Retriving the scores of 2002'''
score_2002 = df[df.Year == str(2002)].Score.tolist()
'''Plotting the histogram of scores in the year 2002
when netflix was the most popular in '''
plt.figure(figsize=(12,6))
plt.hist(x=score_2002,bins=[1,2,3,4,5,6],rwidth=0.9,normed=True,color='#17202A');
plt.xticks([1.5,2.5,3.5,4.5,5.5],[1,2,3,4,5])
plt.title('Distribution of Scores when year is 2002',fontdict={'fontsize':30})
plt.xlabel('Score'); plt.ylabel('Percentage')
plt.savefig('Distribution_of_scores_when_year2002.png')
plt.show()
del score_2002
We can see that in 2002, number of 3 stars surpased the number of 4 stars.
Now let's compare them altogether
'''Retriving the scores of over the last 4 years'''
score_2005 = df[df.Year == str(2005)].Score.tolist()
score_2004 = df[df.Year == str(2004)].Score.tolist()
score_2003 = df[df.Year == str(2003)].Score.tolist()
score_2002 = df[df.Year == str(2002)].Score.tolist()
'''Plotting the histogram of scores in the year 2004
when netflix was the most popular in '''
plt.figure(figsize=(20,10))
clr = ['#A569BD','#A9DFBF','#EDBB99','#CD6155']
plt.hist(x=[score_2002,score_2003,score_2004,score_2005],bins=[1,2,3,4,5,6],\
rwidth=0.7,normed=True,color=clr);
plt.xticks([1.5,2.5,3.5,4.5,5.5],[1,2,3,4,5])
plt.title('Distribution of Scores during the last 4 years',fontdict={'fontsize':30})
plt.xlabel('Score'); plt.ylabel('Percentage')
patch_2002 = mpatches.Patch(color=clr[0], label='2002')
patch_2003 = mpatches.Patch(color=clr[1], label='2003')
patch_2004 = mpatches.Patch(color=clr[2], label='2004')
patch_2005 = mpatches.Patch(color=clr[3], label='2005')
plt.legend(handles=[patch_2002,patch_2003,patch_2004,patch_2005])
plt.savefig('Distribution_of_scores_during_last_4_years.png')
plt.show()
So, we can see that low ratings were given more in 2002 and onwards and in the recent years, more high ratings were given
Hypothesis: