This Notebook is presented by Abhishek Srivastava

## Introduction¶

Search advertising has been one of the major revenue sources of the Internet industry for years. A key technology behind search advertising is to predict the click-through rate (pCTR) of ads, as the economic model behind search advertising requires pCTR values to rank ads and to price clicks. In this task, given the training instances derived from session logs of the Tencent proprietary search engine, soso.com, participants are expected to accurately predict the pCTR of ads in the testing instances.

The training data file is a text file, where each line is a training instance derived from search session log messages. To understand the training data, let us begin with a description of search sessions.

A search session refers to an interaction between a user and the search engine. It contains the following ingredients: the user, the query issued by the user, some ads returned by the search engine and thus impressed (displayed) to the user, and zero or more ads that were clicked by the user. For clarity, we introduce a terminology here. The number of ads impressed in a session is known as the ’depth’. The order of an ad in the impression list is known as the ‘position’ of that ad. An Ad, when impressed, would be displayed as a short text known as ’title’, followed by a slightly longer text known as the ’description’, and a URL (usually shortened to save screen space) known as ’display URL’.

We divide each session into multiple instances, where each instance describes an impressed ad under a certain setting (i.e., with certain depth and position values). We aggregate instances with the same user id, ad id, query, and setting in order to reduce the dataset size. Therefore, schematically, each instance contains at least the following information:

UserID AdID Query Depth Position Impression : the number of search sessions in which the ad (AdID) was impressed by the user (UserID) who issued the query (Query).

Click : the number of times, among the above impressions, the user (UserID) clicked the ad (AdID).

Moreover, the training, validation and testing data contain more information than the above list, because each ad and each user have some additional properties. We include some of these properties into the training, validation and the testing instances, and put other properties in separate data files that can be indexed using ids in the instances. For more information about these data files, please refer to the section ADDITIONAL DATA FILES.

Finally, after including additional features, each training instance is a line consisting of fields delimited by the TAB character:

1. Click: as described in the above list.

2. Impression: as described in the above list.

3. DisplayURL: a property of the ad.

The URL is shown together with the title and description of an ad. It is usually the shortened landing page URL of the ad, but not always. In the data file, this URL is hashed for anonymity.

1. AdID: as described in the above list.

1. Depth: a property of the session, as described above.

2. Position: a property of an ad in a session, as described above.

3. QueryID: id of the query.

This id is a zero‐based integer value. It is the key of the data file 'queryid_tokensid.txt'.

1. KeywordID: a property of ads.

This is the key of 'purchasedkeyword_tokensid.txt'.

1. TitleID: a property of ads.

This is the key of 'titleid_tokensid.txt'.

1. DescriptionID: a property of ads.

This is the key of 'descriptionid_tokensid.txt'.

2. UserID

This is the key of 'userid_profile.txt'. When we cannot identify the user, this field has a special value of 0.

There are five additional data files, as mentioned in the above section:

1. queryid_tokensid.txt

2. purchasedkeywordid_tokensid.txt

3. titleid_tokensid.txt

4. descriptionid_tokensid.txt

5. userid_profile.txt

Each line of the first four files maps an id to a list of tokens, corresponding to the query, keyword, ad title, and ad description, respectively. In each line, a TAB character separates the id and the token set. A token can basically be a word in a natural language. For anonymity, each token is represented by its hash value. Tokens are delimited by the character ‘|’.

Each line of ‘userid_profile.txt’ is composed of UserID, Gender, and Age, delimited by the TAB character. Note that not every UserID in the training and the testing set will be present in ‘userid_profile.txt’. Each field is described below:

1. Gender:

'1' for male, '2' for female, and '0' for unknown.

1. Age:

'1' for (0, 12], '2' for (12, 18], '3' for (18, 24], '4' for (24, 30], '5' for (30, 40], and '6' for greater than 40.

In [1]:
# Loading libraries...
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns


## Preparing Data¶

In [2]:
# Load Training Data..


Out[2]:
0 0 1 4298118681424644510 7686695 385 3 3 1601 5521 7709 576 490234
1 0 1 4860571499428580850 21560664 37484 2 2 2255103 317 48989 44771 490234
2 0 1 9704320783495875564 21748480 36759 3 3 4532751 60721 685038 29681 490234
3 0 1 13677630321509009335 3517124 23778 3 1 1601 2155 1207 1422 490234
4 0 1 3284760244799604489 20758093 34535 1 1 4532751 77819 266618 222223 490234
In [3]:
# Load User Data..

user_col  = ['UId', 'Gender', 'Age']

Out[3]:
UId Gender Age
0 1 1 5
1 2 2 3
2 3 1 5
3 4 1 3
4 5 2 1
In [4]:
# Load Query Data..

query_col = ['QId', 'Query']

Out[4]:
QId Query
0 0 12731
1 1 1545|75|31
2 2 383
3 3 518|1996
4 4 4189|75|31
In [5]:
# Load Ad Description Data..

desc_col  = ['DescId', 'Description']

Out[5]:
DescId Description
0 0 1545|31|40|615|1|272|18889|1|220|511|20|5270|1...
1 1 172|46|467|170|5634|5112|40|155|1965|834|21|41...
2 2 2672|6|1159|109662|123|49933|160|848|248|207|1...
3 3 13280|35|1299|26|282|477|606|1|4016|1671|771|1...
4 4 13327|99|128|494|2928|21|26500|10|11733|10|318...
In [6]:
# Load Ad Title Data..

title_col = ['TitleId', 'Title']

Out[6]:
TitleId Title
0 0 615|1545|75|31|1|138|1270|615|131
1 1 466|582|685|1|42|45|477|314
2 2 12731|190|513|12731|677|183
3 3 2371|3970|1|2805|4340|3|2914|10640|3688|11|834|3
4 4 165|134|460|2887|50|2|17527|1|1540|592|2181|3|...
In [7]:
def count(sentence):
'''
(str) -> (int)
Returns no. of words in a sentence.
'''
return len(str(sentence).split('|'))

In [8]:
# Count no. of words in a query issued by a user.

query['QCount'] = query['Query'].apply(count)

In [9]:
query.head(5)

Out[9]:
QId Query QCount
0 0 12731 1
1 1 1545|75|31 3
2 2 383 1
3 3 518|1996 2
4 4 4189|75|31 3
In [10]:
# Query isn't required now, get rid of it.

del query['Query']

Out[10]:
QId QCount
0 0 1
1 1 3
2 2 1
3 3 2
4 4 3
In [11]:
# Count no. of words in title of an advertisement.

title['TCount'] = title['Title'].apply(count)

In [12]:
title.head()

Out[12]:
TitleId Title TCount
0 0 615|1545|75|31|1|138|1270|615|131 9
1 1 466|582|685|1|42|45|477|314 8
2 2 12731|190|513|12731|677|183 6
3 3 2371|3970|1|2805|4340|3|2914|10640|3688|11|834|3 12
4 4 165|134|460|2887|50|2|17527|1|1540|592|2181|3|... 16
In [13]:
# Advertisement Title isn't required now, get rid of it.

del title['Title']

Out[13]:
TitleId TCount
0 0 9
1 1 8
2 2 6
3 3 12
4 4 16
In [14]:
# Count no. of words in description of an advertisement.

desc['DCount'] = desc['Description'].apply(count)

In [15]:
desc.head()

Out[15]:
DescId Description DCount
0 0 1545|31|40|615|1|272|18889|1|220|511|20|5270|1... 20
1 1 172|46|467|170|5634|5112|40|155|1965|834|21|41... 28
2 2 2672|6|1159|109662|123|49933|160|848|248|207|1... 21
3 3 13280|35|1299|26|282|477|606|1|4016|1671|771|1... 25
4 4 13327|99|128|494|2928|21|26500|10|11733|10|318... 17
In [16]:
# Advertisement Description isn't required now, get rid of it.

del desc['Description']

Out[16]:
DescId DCount
0 0 20
1 1 28
2 2 21
3 3 25
4 4 17
In [17]:
# Merging orignal with user, query, title & desc on appropriate keys to get data..

data = pd.merge(orignal, user,  on='UId')
data = pd.merge(data,    query, on='QId')
data = pd.merge(data,    title, on='TitleId')
data = pd.merge(data,    desc,  on='DescId')

In [18]:
data.head()

Out[18]:
Click Impression AdURL AdId AdvId Depth Pos QId KeyId TitleId DescId UId Gender Age QCount TCount DCount
0 0 1 4298118681424644510 7686695 385 3 3 1601 5521 7709 576 490234 1 3 1 8 21
1 0 2 4298118681424644510 7686695 385 2 2 1601 5521 7709 576 30161 1 3 1 8 21
2 0 1 4298118681424644510 7686695 385 2 2 1601 5521 7709 576 1873171 2 5 1 8 21
3 0 1 4298118681424644510 7686695 385 3 3 1601 5521 7709 576 6558374 0 2 1 8 21
4 0 2 4298118681424644510 7686695 385 3 3 1601 5521 7709 576 1566180 2 5 1 8 21
In [19]:
# Add target variable CTR to the dataset...

data['CTR'] = data['Click'] * 1.0 / data['Impression'] * 100

Out[19]:
Click Impression AdURL AdId AdvId Depth Pos QId KeyId TitleId DescId UId Gender Age QCount TCount DCount CTR
0 0 1 4298118681424644510 7686695 385 3 3 1601 5521 7709 576 490234 1 3 1 8 21 0.0
1 0 2 4298118681424644510 7686695 385 2 2 1601 5521 7709 576 30161 1 3 1 8 21 0.0
2 0 1 4298118681424644510 7686695 385 2 2 1601 5521 7709 576 1873171 2 5 1 8 21 0.0
3 0 1 4298118681424644510 7686695 385 3 3 1601 5521 7709 576 6558374 0 2 1 8 21 0.0
4 0 2 4298118681424644510 7686695 385 3 3 1601 5521 7709 576 1566180 2 5 1 8 21 0.0
In [20]:
# Basic Information about the data...

data.shape

Out[20]:
(4952274, 18)

Note: We loaded 5M datapoints initially, after merger we have around 4.95M datapoints. What does this indicate ? Actually for a lot of user ids data is missing hence merge operation gets rid of such datapoints.

## Data Analysis¶

In [89]:
# CTR(ad) = #Clicks(ad)/#Impressions(ad)

# Calculating net CTR for our dataset...

total_impressions = data['Impression'].sum()
total_clicks      = data['Click'].sum()
net_CTR           = total_clicks * 1.0 / total_impressions

print ('Net CTR: {0}'.format(round(net_CTR*100,2))), '%'

Net CTR: 4.2 %

In [22]:
total = data.shape[0]

In [23]:
# Percentage of unique users in the dataset...

print round(len(data.groupby('UId')) * 1.0 / total * 100, 2), '%'

19.62 %

In [24]:
# Percentage of unique queries in the dataset...

print round(len(data.groupby('QId')) * 1.0 / total * 100, 2), '%'

23.34 %

In [25]:
# Percentage of unique advertisements in the dataset...

print round(len(data.groupby('AdId')) * 1.0 / total * 100, 2) , '%'

4.29 %

In [26]:
# Percentage of unique advertisers in the dataset...

print round(len(data.groupby('AdvId')) * 1.0 / total * 100, 2), '%'

0.28 %

In [63]:
# Distribution of word count in a search query...

temp = data[['QCount']].copy()

print 'Maximum Length of a Query: ', temp['QCount'].max()
print 'Average Length of a Query: ', temp['QCount'].mean()

Maximum Length of a Query:  127
Average Length of a Query:  2.98855354126

In [64]:
sns.boxplot(x=None,y='QCount',data=temp)

Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x1dd07da50>

Clearly, data contains outliers. We will remove them before constructing box plots otherwise observation is difficult. This is shown as follows:

In [68]:
print 'Median No. of words in a Search query:', temp['QCount'].quantile(0.5)
print '3rd Quantile No. of words in a Search query:', temp['QCount'].quantile(0.75)

Median No. of words in a Search query: 3.0
3rd Quantile No. of words in a Search query: 4.0

In [66]:
# Remove outliers

temp = temp[temp['QCount'] < 10.0]
print 'Maximum Length of a Query: ', temp['QCount'].max()
print 'Average Length of a Query: ', temp['QCount'].mean()

Maximum Length of a Query:  9
Average Length of a Query:  2.94981899189

In [67]:
# Distribution of word count in a query...

plt.figure(figsize=(10,4))

plt.subplot(1, 2, 1)
plt.hist(temp['QCount'],
color='green',
bins=25,
normed=False)
plt.xlabel('No. of words in a Query')

plt.subplot(1, 2, 2)
plt.boxplot(temp['QCount'],
labels=['No. of words in a Query'],
)

plt.tight_layout()


Conclusion:

1. No. of words in a search query issued by an user is mostly < 4.0 for 75% of search queries.
2. Median of words in a search query is around 3 words.
In [21]:
# Distribution of word count in an ad description

temp = data[['DCount']].copy()

print 'Maximum Length of an Ad Description: ', temp['DCount'].max()
print 'Average Length of an Ad Description: ', temp['DCount'].mean()

Maximum Length of an Ad Description:  47
Average Length of an Ad Description:  21.311864812

In [23]:
# Distribution of word count in description of an ad...

plt.figure(figsize=(10,4))

plt.subplot(1, 2, 1)
plt.hist(temp['DCount'],
bins=100,
color='red',
normed=False)
plt.xlabel('No. of words in a Ad Description')

plt.subplot(1, 2, 2)
plt.boxplot(temp['DCount'],
labels=['No. of words in a Ad Description'],
)

plt.tight_layout()


Conclusion:

1. No. of words in the description of most of the advertisements (75%) are in the range of [15,25].
In [350]:
# Distribution of word count in an ad title

temp = data[['TCount']].copy()

print 'Maximum Length of an Ad Title: ', temp['TCount'].max()
print 'Average Length of an Ad Title: ', temp['TCount'].mean()

Maximum Length of an Ad Title:  32
Average Length of an Ad Title:  8.76529893136

In [351]:
# Distribution of word count in a ad title...

plt.figure(figsize=(10,4))

plt.subplot(1, 2, 1)
plt.hist(temp['TCount'],
color='red',
bins=100,
normed=False)
plt.xlabel('No. of words in a Ad Title')

plt.subplot(1, 2, 2)
plt.boxplot(temp['TCount'],
labels=['No. of words in a Ad Title'],
)

plt.tight_layout()


Conclusion:

1. No. of words in the title of most of the advertisements (75%) are in the range of [7,12].
In [42]:
# Does less no. of words in a query indicate high CTR...

temp = data[['QCount', 'CTR']].copy()

Out[42]:
QCount CTR
0 1 0.0
1 1 0.0
2 1 0.0
3 1 0.0
4 1 0.0
In [43]:
temp = temp[temp['QCount'] <= 20]

In [44]:
result = temp.groupby('QCount').agg(['mean'])

Out[44]:
CTR
mean
QCount
1 3.707213
2 4.868142
3 4.902491
4 4.597467
5 4.271822
In [355]:
plt.figure(figsize=(5,5))

plt.bar(result.index, result[('CTR', 'mean')],
color='red',
width=0.4)
plt.xlabel('No. of words in a Query')
plt.ylabel('Avg. CTR')

plt.tight_layout()


Conclusion: As no. of words in a query inc. CTR on avg. dec.

In [356]:
# How is the word count of Ad Title related to CTR of an Ad...

temp = data[['TCount', 'CTR']].copy()

In [357]:
result = temp.groupby('TCount').agg(['mean'])

Out[357]:
CTR
mean
TCount
1 5.188903
2 5.468239
3 4.382002
4 4.484546
5 4.361038
In [358]:
plt.figure(figsize=(5,5))

plt.bar(result.index, result[('CTR', 'mean')],
color='red',
width=0.4)
plt.xlabel('No. of words in a Ad Description')
plt.ylabel('Avg. CTR')

plt.tight_layout()


Conclusion: Avg. Ad CTR is more or less distributed uniformly with no. of words in Ad description.

In [21]:
# Is there a relation between no. of times an ad appears (impression) & no. of it is clicked.


Out[21]:
0 7686695 1 0
1 7686695 2 0
2 7686695 1 0
3 7686695 1 0
4 7686695 2 0
In [22]:
result = temp.groupby('AdId').agg(['mean'])

Out[22]:
Impression Click
mean mean
1000031 1.000000 0.000000
1000467 2.315789 0.105263
1000468 1.000000 0.000000
1000469 1.000000 0.000000
1000471 1.000000 0.000000
1000473 2.000000 0.000000
In [23]:
x = result[('Impression', 'mean')]
y = result[('Click', 'mean')]
plt.scatter(x,
y,
c='green',
s=100,
marker='o',
edgecolor=None)
plt.xlabel('No. of Impressions')
plt.ylabel('No. of Clicks')
plt.title('Relationship between Ad Impressions & Clicks')

Out[23]:
<matplotlib.text.Text at 0x11c6f7290>

Conclusion: As no. of impressions of an advertisement inc. clicks are mostly ~ 0.

This indicates a very crucial aspect of human behaviour. As a user see the same ad again & again, they are less likely to click it.

In [363]:
# Let us see how Gender of a user has an impact on CTR

temp = data[['Gender', 'CTR']].copy()

Out[363]:
Gender CTR
0 1 0.0
1 1 0.0
2 2 0.0
3 0 0.0
4 2 0.0
In [364]:
result = temp.groupby('Gender').agg(['mean'])

Out[364]:
CTR
mean
Gender
0 4.527559
1 4.409308
2 4.585996
In [365]:
plt.figure(figsize=(5,5))

plt.bar(result.index, result[('CTR', 'mean')],
color='red',
width=0.3)
plt.xlabel('Gender')
plt.ylabel('Avg. CTR')

plt.tight_layout()


In [21]:
# Let us see how Age of a user affect CTR:

temp = data[['Age', 'CTR']].copy()

Out[21]:
Age CTR
0 3 0.0
1 3 0.0
2 5 0.0
3 2 0.0
4 5 0.0
In [22]:
result = temp.groupby('Age').agg(['mean'])

Out[22]:
CTR
mean
Age
1 4.457391
2 4.525747
3 4.384555
4 4.232053
5 4.746470
6 5.256566
In [24]:
plt.figure(figsize=(5,5))

plt.bar(result.index, result[('CTR', 'mean')],
color='red',
width=0.3)
plt.xlabel('Age')
plt.ylabel('Avg. CTR')

plt.tight_layout()


Conclusion: An aged user has higher avg. CTR as compared to a young user.

In [368]:
# How does an gender of an aged person affect CTR of an ad...

temp = data[['Gender', 'Age', 'CTR']].copy()
temp = temp[(temp['Age'] == 5) | (temp['Age'] == 6)] # filter aged users.

Out[368]:
Gender Age CTR
2 2 5 0.0
4 2 5 0.0
6 1 6 0.0
14 2 5 0.0
49 2 5 0.0
In [369]:
temp = temp[['Gender', 'CTR']].copy()
result = temp.groupby('Gender').agg(['mean'])

Out[369]:
CTR
mean
Gender
0 4.277570
1 4.808067
2 5.025482
In [371]:
plt.figure(figsize=(5,5))

plt.bar(result.index, result[('CTR', 'mean')],
color='green',
width=0.5)
plt.xlabel('Gender')
plt.ylabel('Avg. CTR')

plt.tight_layout()


Conclusion: CTR for an aged male or female user is almost similar & as usual higher than avg. CTR of male & female users across all age groups.

In [372]:
# Let us try to see if position of an ad has an affect on CTR

temp = data[['Pos', 'CTR']].copy()

Out[372]:
Pos CTR
0 3 0.0
1 2 0.0
2 2 0.0
3 3 0.0
4 3 0.0
In [373]:
result = temp.groupby('Pos').agg(['mean', 'count'])

Out[373]:
CTR
mean count
Pos
1 5.518009 3022771
2 3.077325 1597657
3 1.894403 331846
In [374]:
plt.figure(figsize=(10,4))

plt.subplot(1, 2, 1)
plt.bar(result.index, result[('CTR', 'mean')],
color='red')
plt.xlabel('Position')
plt.ylabel('Avg. CTR')

plt.subplot(1, 2, 2)
plt.bar(result.index, result[('CTR', 'count')],
color='green')
plt.xlabel('Position')

plt.tight_layout()


Conclusion: Clearly, the CTR for an advertisement which has a low position (more visible to user) is higher as compared to CTR of an advertisement with higher position(not directly visible).

In [375]:
# Let us try to see if depth of an ad has an affect on CTR

temp = data[['Depth', 'CTR']].copy()

Out[375]:
Depth CTR
0 3 0.0
1 2 0.0
2 2 0.0
3 3 0.0
4 3 0.0
In [376]:
result = temp.groupby('Depth').agg(['mean', 'count'])

Out[376]:
CTR
mean count
Depth
1 4.577864 1450766
2 4.799309 2514678
3 3.561618 986830
In [377]:
plt.figure(figsize=(10,4))

plt.subplot(1, 2, 1)
plt.bar(result.index, result[('CTR', 'mean')],
color='red')
plt.xlabel('Depth')
plt.ylabel('Avg. CTR')

plt.subplot(1, 2, 2)
plt.bar(result.index, result[('CTR', 'count')],
color='green')
plt.xlabel('Depth')

plt.tight_layout()


Conclusion:

1. Mostly depth of a Search Session is 2.
2. If depth if high (3) avg. CTR falls. This means if there as no. of ads in a Search Session inc. avg. CTR dec.
In [114]:
# Studying the role of Advertiser...

# What is my goal ? I am trying to find advertisers who have high CTR. Why ?

# This will help me understand what different are they doing as opposed to other advertisers.

# Once we have advertisers with high CTR, we can see how ad properties vary.

# We can know if an advertiser with high CTR ads use more words to describe ad, more words to describe ad title etc...

In [24]:
# Preparing data...

temp = data[['AdvId', 'CTR', 'DCount', 'TCount']].copy()

Out[24]:
0 385 0.0 21 8
1 385 0.0 21 8
2 385 0.0 21 8
3 385 0.0 21 8
4 385 0.0 21 8
In [25]:
result = temp.groupby('AdvId').agg(['mean'])

Out[25]:
CTR DCount TCount
mean mean mean
78 0.000000 2.000000 3.000000
80 0.000000 2.000000 3.000000
81 7.317073 26.000000 11.000000
82 3.693460 22.000000 5.000000
83 3.378378 17.783784 5.081081
In [26]:
temp = pd.DataFrame()

temp['CTR']      = result[('CTR', 'mean')].get_values()
temp['DCount']   = result[('DCount', 'mean')].get_values()
temp['TCount']   = result[('TCount', 'mean')].get_values()


Out[26]:
0 78 0.000000 2.000000 3.000000
1 80 0.000000 2.000000 3.000000
2 81 7.317073 26.000000 11.000000
3 82 3.693460 22.000000 5.000000
4 83 3.378378 17.783784 5.081081
In [27]:
print 'No. of unique advertisers: ',temp.shape[0]

No. of unique advertisers:  13921

In [ ]:
# Deciding how an advertiser qualifies as an advertiser with high CTR ads...

# Let us study distribution of avg. CTR...

In [38]:
f, (ax1, ax2) = plt.subplots(2)
sns.kdeplot(temp['CTR'], ax=ax1)
sns.boxplot(x=None,y='CTR',data=temp, ax=ax2)

Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d3346b50>
In [42]:
mean_advertiser_ctr = temp['CTR'].mean()

Average CTR of Ads given by an advertiser:  3.94