This Notebook is presented by Abhishek Srivastava
Project Lead: Srinivas Singam Reddy
Search advertising has been one of the major revenue sources of the Internet industry for years. A key technology behind search advertising is to predict the click-through rate (pCTR) of ads, as the economic model behind search advertising requires pCTR values to rank ads and to price clicks. In this task, given the training instances derived from session logs of the Tencent proprietary search engine, soso.com, participants are expected to accurately predict the pCTR of ads in the testing instances.
The training data file is a text file, where each line is a training instance derived from search session log messages. To understand the training data, let us begin with a description of search sessions.
A search session refers to an interaction between a user and the search engine. It contains the following ingredients: the user, the query issued by the user, some ads returned by the search engine and thus impressed (displayed) to the user, and zero or more ads that were clicked by the user. For clarity, we introduce a terminology here. The number of ads impressed in a session is known as the ’depth’. The order of an ad in the impression list is known as the ‘position’ of that ad. An Ad, when impressed, would be displayed as a short text known as ’title’, followed by a slightly longer text known as the ’description’, and a URL (usually shortened to save screen space) known as ’display URL’.
We divide each session into multiple instances, where each instance describes an impressed ad under a certain setting (i.e., with certain depth and position values). We aggregate instances with the same user id, ad id, query, and setting in order to reduce the dataset size. Therefore, schematically, each instance contains at least the following information:
UserID AdID Query Depth Position Impression : the number of search sessions in which the ad (AdID) was impressed by the user (UserID) who issued the query (Query).
Click : the number of times, among the above impressions, the user (UserID) clicked the ad (AdID).
Moreover, the training, validation and testing data contain more information than the above list, because each ad and each user have some additional properties. We include some of these properties into the training, validation and the testing instances, and put other properties in separate data files that can be indexed using ids in the instances. For more information about these data files, please refer to the section ADDITIONAL DATA FILES.
Finally, after including additional features, each training instance is a line consisting of fields delimited by the TAB character:
Click: as described in the above list.
Impression: as described in the above list.
DisplayURL: a property of the ad.
The URL is shown together with the title and description of an ad. It is usually the shortened landing page URL of the ad, but not always. In the data file, this URL is hashed for anonymity.
AdID: as described in the above list.
AdvertiserID: a property of the ad.
Some advertisers consistently optimize their ads, so the title and description of their ads are more attractive than those of others’ ads.
Depth: a property of the session, as described above.
Position: a property of an ad in a session, as described above.
QueryID: id of the query.
This id is a zero‐based integer value. It is the key of the data file 'queryid_tokensid.txt'.
This is the key of 'purchasedkeyword_tokensid.txt'.
This is the key of 'titleid_tokensid.txt'.
DescriptionID: a property of ads.
This is the key of 'descriptionid_tokensid.txt'.
UserID
This is the key of 'userid_profile.txt'. When we cannot identify the user, this field has a special value of 0.
ADDITIONAL DATA FILES
There are five additional data files, as mentioned in the above section:
queryid_tokensid.txt
purchasedkeywordid_tokensid.txt
titleid_tokensid.txt
descriptionid_tokensid.txt
userid_profile.txt
Each line of the first four files maps an id to a list of tokens, corresponding to the query, keyword, ad title, and ad description, respectively. In each line, a TAB character separates the id and the token set. A token can basically be a word in a natural language. For anonymity, each token is represented by its hash value. Tokens are delimited by the character ‘|’.
Each line of ‘userid_profile.txt’ is composed of UserID, Gender, and Age, delimited by the TAB character. Note that not every UserID in the training and the testing set will be present in ‘userid_profile.txt’. Each field is described below:
'1' for male, '2' for female, and '0' for unknown.
'1' for (0, 12], '2' for (12, 18], '3' for (18, 24], '4' for (24, 30], '5' for (30, 40], and '6' for greater than 40.
# Loading libraries...
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# Load Training Data..
column = ['Click', 'Impression', 'AdURL', 'AdId', 'AdvId', 'Depth', 'Pos', 'QId', 'KeyId', 'TitleId', 'DescId', 'UId']
orignal = pd.read_csv('training.txt', sep='\t', header=None, nrows = 5000000, names=column)
orignal.head()
# Load User Data..
user_col = ['UId', 'Gender', 'Age']
user = pd.read_csv('userid_profile.txt', sep='\t', header=None, names=user_col)
user.head()
# Load Query Data..
query_col = ['QId', 'Query']
query = pd.read_csv('queryid_tokensid.txt', sep='\t', header=None, names=query_col)
query.head(5)
# Load Ad Description Data..
desc_col = ['DescId', 'Description']
desc = pd.read_csv('descriptionid_tokensid.txt', sep='\t', header=None, names=desc_col)
desc.head(5)
# Load Ad Title Data..
title_col = ['TitleId', 'Title']
title = pd.read_csv('titleid_tokensid.txt', sep='\t', header=None, names=title_col)
title.head(5)
def count(sentence):
'''
(str) -> (int)
Returns no. of words in a sentence.
'''
return len(str(sentence).split('|'))
# Count no. of words in a query issued by a user.
query['QCount'] = query['Query'].apply(count)
query.head(5)
# Query isn't required now, get rid of it.
del query['Query']
query.head()
# Count no. of words in title of an advertisement.
title['TCount'] = title['Title'].apply(count)
title.head()
# Advertisement Title isn't required now, get rid of it.
del title['Title']
title.head()
# Count no. of words in description of an advertisement.
desc['DCount'] = desc['Description'].apply(count)
desc.head()
# Advertisement Description isn't required now, get rid of it.
del desc['Description']
desc.head()
# Merging orignal with user, query, title & desc on appropriate keys to get data..
data = pd.merge(orignal, user, on='UId')
data = pd.merge(data, query, on='QId')
data = pd.merge(data, title, on='TitleId')
data = pd.merge(data, desc, on='DescId')
data.head()
# Add target variable CTR to the dataset...
data['CTR'] = data['Click'] * 1.0 / data['Impression'] * 100
data.head()
# Basic Information about the data...
data.shape
Note: We loaded 5M datapoints initially, after merger we have around 4.95M datapoints. What does this indicate ? Actually for a lot of user ids data is missing hence merge operation gets rid of such datapoints.
# CTR(ad) = #Clicks(ad)/#Impressions(ad)
# Calculating net CTR for our dataset...
total_impressions = data['Impression'].sum()
total_clicks = data['Click'].sum()
net_CTR = total_clicks * 1.0 / total_impressions
print ('Net CTR: {0}'.format(round(net_CTR*100,2))), '%'
total = data.shape[0]
# Percentage of unique users in the dataset...
print round(len(data.groupby('UId')) * 1.0 / total * 100, 2), '%'
# Percentage of unique queries in the dataset...
print round(len(data.groupby('QId')) * 1.0 / total * 100, 2), '%'
# Percentage of unique advertisements in the dataset...
print round(len(data.groupby('AdId')) * 1.0 / total * 100, 2) , '%'
# Percentage of unique advertisers in the dataset...
print round(len(data.groupby('AdvId')) * 1.0 / total * 100, 2), '%'
# Distribution of word count in a search query...
temp = data[['QCount']].copy()
print 'Maximum Length of a Query: ', temp['QCount'].max()
print 'Average Length of a Query: ', temp['QCount'].mean()
sns.boxplot(x=None,y='QCount',data=temp)
Clearly, data contains outliers. We will remove them before constructing box plots otherwise observation is difficult. This is shown as follows:
print 'Median No. of words in a Search query:', temp['QCount'].quantile(0.5)
print '3rd Quantile No. of words in a Search query:', temp['QCount'].quantile(0.75)
# Remove outliers
temp = temp[temp['QCount'] < 10.0]
print 'Maximum Length of a Query: ', temp['QCount'].max()
print 'Average Length of a Query: ', temp['QCount'].mean()
# Distribution of word count in a query...
plt.figure(figsize=(10,4))
plt.subplot(1, 2, 1)
plt.hist(temp['QCount'],
color='green',
bins=25,
normed=False)
plt.xlabel('No. of words in a Query')
plt.subplot(1, 2, 2)
plt.boxplot(temp['QCount'],
labels=['No. of words in a Query'],
)
plt.tight_layout()
Conclusion:
# Distribution of word count in an ad description
temp = data[['DCount']].copy()
print 'Maximum Length of an Ad Description: ', temp['DCount'].max()
print 'Average Length of an Ad Description: ', temp['DCount'].mean()
# Distribution of word count in description of an ad...
plt.figure(figsize=(10,4))
plt.subplot(1, 2, 1)
plt.hist(temp['DCount'],
bins=100,
color='red',
normed=False)
plt.xlabel('No. of words in a Ad Description')
plt.subplot(1, 2, 2)
plt.boxplot(temp['DCount'],
labels=['No. of words in a Ad Description'],
)
plt.tight_layout()
Conclusion:
# Distribution of word count in an ad title
temp = data[['TCount']].copy()
print 'Maximum Length of an Ad Title: ', temp['TCount'].max()
print 'Average Length of an Ad Title: ', temp['TCount'].mean()
# Distribution of word count in a ad title...
plt.figure(figsize=(10,4))
plt.subplot(1, 2, 1)
plt.hist(temp['TCount'],
color='red',
bins=100,
normed=False)
plt.xlabel('No. of words in a Ad Title')
plt.subplot(1, 2, 2)
plt.boxplot(temp['TCount'],
labels=['No. of words in a Ad Title'],
)
plt.tight_layout()
Conclusion:
# Does less no. of words in a query indicate high CTR...
temp = data[['QCount', 'CTR']].copy()
temp.head()
temp = temp[temp['QCount'] <= 20]
result = temp.groupby('QCount').agg(['mean'])
result.head()
plt.figure(figsize=(5,5))
plt.bar(result.index, result[('CTR', 'mean')],
color='red',
width=0.4)
plt.xlabel('No. of words in a Query')
plt.ylabel('Avg. CTR')
plt.tight_layout()
Conclusion: As no. of words in a query inc. CTR on avg. dec.
# How is the word count of Ad Title related to CTR of an Ad...
temp = data[['TCount', 'CTR']].copy()
result = temp.groupby('TCount').agg(['mean'])
result.head()
plt.figure(figsize=(5,5))
plt.bar(result.index, result[('CTR', 'mean')],
color='red',
width=0.4)
plt.xlabel('No. of words in a Ad Description')
plt.ylabel('Avg. CTR')
plt.tight_layout()
Conclusion: Avg. Ad CTR is more or less distributed uniformly with no. of words in Ad description.
# Is there a relation between no. of times an ad appears (impression) & no. of it is clicked.
temp = data[['AdId', 'Impression', 'Click']].copy()
temp.head()
result = temp.groupby('AdId').agg(['mean'])
result.head(6)
x = result[('Impression', 'mean')]
y = result[('Click', 'mean')]
plt.scatter(x,
y,
c='green',
s=100,
marker='o',
edgecolor=None)
plt.xlabel('No. of Impressions')
plt.ylabel('No. of Clicks')
plt.title('Relationship between Ad Impressions & Clicks')
Conclusion: As no. of impressions of an advertisement inc. clicks are mostly ~ 0.
This indicates a very crucial aspect of human behaviour. As a user see the same ad again & again, they are less likely to click it.
# Let us see how Gender of a user has an impact on CTR
temp = data[['Gender', 'CTR']].copy()
temp.head()
result = temp.groupby('Gender').agg(['mean'])
result.head()
plt.figure(figsize=(5,5))
plt.bar(result.index, result[('CTR', 'mean')],
color='red',
width=0.3)
plt.xlabel('Gender')
plt.ylabel('Avg. CTR')
plt.tight_layout()
Conclusion: Gender of a user doesn't impact CTR of an advertisement.
# Let us see how Age of a user affect CTR:
temp = data[['Age', 'CTR']].copy()
temp.head()
result = temp.groupby('Age').agg(['mean'])
result.head(6)
plt.figure(figsize=(5,5))
plt.bar(result.index, result[('CTR', 'mean')],
color='red',
width=0.3)
plt.xlabel('Age')
plt.ylabel('Avg. CTR')
plt.tight_layout()
Conclusion: An aged user has higher avg. CTR as compared to a young user.
# How does an gender of an aged person affect CTR of an ad...
temp = data[['Gender', 'Age', 'CTR']].copy()
temp = temp[(temp['Age'] == 5) | (temp['Age'] == 6)] # filter aged users.
temp.head()
temp = temp[['Gender', 'CTR']].copy()
result = temp.groupby('Gender').agg(['mean'])
result.head()
plt.figure(figsize=(5,5))
plt.bar(result.index, result[('CTR', 'mean')],
color='green',
width=0.5)
plt.xlabel('Gender')
plt.ylabel('Avg. CTR')
plt.tight_layout()
Conclusion: CTR for an aged male or female user is almost similar & as usual higher than avg. CTR of male & female users across all age groups.
# Let us try to see if position of an ad has an affect on CTR
temp = data[['Pos', 'CTR']].copy()
temp.head()
result = temp.groupby('Pos').agg(['mean', 'count'])
result.head()
plt.figure(figsize=(10,4))
plt.subplot(1, 2, 1)
plt.bar(result.index, result[('CTR', 'mean')],
color='red')
plt.xlabel('Position')
plt.ylabel('Avg. CTR')
plt.subplot(1, 2, 2)
plt.bar(result.index, result[('CTR', 'count')],
color='green')
plt.xlabel('Position')
plt.ylabel('Frequency of Ads')
plt.tight_layout()
Conclusion: Clearly, the CTR for an advertisement which has a low position (more visible to user) is higher as compared to CTR of an advertisement with higher position(not directly visible).
Typically advertisement have lower position. [1,2]
# Let us try to see if depth of an ad has an affect on CTR
temp = data[['Depth', 'CTR']].copy()
temp.head()
result = temp.groupby('Depth').agg(['mean', 'count'])
result.head()
plt.figure(figsize=(10,4))
plt.subplot(1, 2, 1)
plt.bar(result.index, result[('CTR', 'mean')],
color='red')
plt.xlabel('Depth')
plt.ylabel('Avg. CTR')
plt.subplot(1, 2, 2)
plt.bar(result.index, result[('CTR', 'count')],
color='green')
plt.xlabel('Depth')
plt.ylabel('Frequency of Ads')
plt.tight_layout()
Conclusion:
# Studying the role of Advertiser...
# What is my goal ? I am trying to find advertisers who have high CTR. Why ?
# So that I can investigate Ads given by these advertisers.
# This will help me understand what different are they doing as opposed to other advertisers.
# Once we have advertisers with high CTR, we can see how ad properties vary.
# We can know if an advertiser with high CTR ads use more words to describe ad, more words to describe ad title etc...
# Preparing data...
temp = data[['AdvId', 'CTR', 'DCount', 'TCount']].copy()
temp.head()
result = temp.groupby('AdvId').agg(['mean'])
result.head()
temp = pd.DataFrame()
temp['AdvId'] = result.index
temp['CTR'] = result[('CTR', 'mean')].get_values()
temp['DCount'] = result[('DCount', 'mean')].get_values()
temp['TCount'] = result[('TCount', 'mean')].get_values()
temp.head()
print 'No. of unique advertisers: ',temp.shape[0]
# Deciding how an advertiser qualifies as an advertiser with high CTR ads...
# Let us study distribution of avg. CTR...
f, (ax1, ax2) = plt.subplots(2)
sns.kdeplot(temp['CTR'], ax=ax1)
sns.boxplot(x=None,y='CTR',data=temp, ax=ax2)
mean_advertiser_ctr = temp['CTR'].mean()
print 'Average CTR of Ads given by an advertiser: ', round(mean_advertiser_ctr, 2)
median_advertiser_ctr = temp['CTR'].median()
print 'Median CTR of Ads given by an advertiser: ', round(median_advertiser_ctr, 2)
third_quantile_advertiser_ctr = temp['CTR'].quantile(0.75)
print '3rd Quantile CTR of Ads given by an advertiser: ', round(third_quantile_advertiser_ctr, 2)