This Notebook is presented by Abhishek Srivastava

Project Lead: Srinivas Singam Reddy

Introduction¶

Search advertising has been one of the major revenue sources of the Internet industry for years. A key technology behind search advertising is to predict the click-through rate (pCTR) of ads, as the economic model behind search advertising requires pCTR values to rank ads and to price clicks. In this task, given the training instances derived from session logs of the Tencent proprietary search engine, soso.com, participants are expected to accurately predict the pCTR of ads in the testing instances.

About Training Datafile¶

The training data file is a text file, where each line is a training instance derived from search session log messages. To understand the training data, let us begin with a description of search sessions.

A search session refers to an interaction between a user and the search engine. It contains the following ingredients: the user, the query issued by the user, some ads returned by the search engine and thus impressed (displayed) to the user, and zero or more ads that were clicked by the user. For clarity, we introduce a terminology here. The number of ads impressed in a session is known as the ’depth’. The order of an ad in the impression list is known as the ‘position’ of that ad. An Ad, when impressed, would be displayed as a short text known as ’title’, followed by a slightly longer text known as the ’description’, and a URL (usually shortened to save screen space) known as ’display URL’.

We divide each session into multiple instances, where each instance describes an impressed ad under a certain setting (i.e., with certain depth and position values). We aggregate instances with the same user id, ad id, query, and setting in order to reduce the dataset size. Therefore, schematically, each instance contains at least the following information:

UserID AdID Query Depth Position Impression : the number of search sessions in which the ad (AdID) was impressed by the user (UserID) who issued the query (Query).

Click : the number of times, among the above impressions, the user (UserID) clicked the ad (AdID).

Moreover, the training, validation and testing data contain more information than the above list, because each ad and each user have some additional properties. We include some of these properties into the training, validation and the testing instances, and put other properties in separate data files that can be indexed using ids in the instances. For more information about these data files, please refer to the section ADDITIONAL DATA FILES.

Finally, after including additional features, each training instance is a line consisting of fields delimited by the TAB character:

Click: as described in the above list.
Impression: as described in the above list.
DisplayURL: a property of the ad.

The URL is shown together with the title and description of an ad. It is usually the shortened landing page URL of the ad, but not always. In the data file, this URL is hashed for anonymity.

AdID: as described in the above list.
AdvertiserID: a property of the ad.

Some advertisers consistently optimize their ads, so the title and description of their ads are more attractive than those of others’ ads.

Depth: a property of the session, as described above.
Position: a property of an ad in a session, as described above.
QueryID: id of the query.

This id is a zero‐based integer value. It is the key of the data file 'queryid_tokensid.txt'.

KeywordID: a property of ads.

This is the key of 'purchasedkeyword_tokensid.txt'.

TitleID: a property of ads.

This is the key of 'titleid_tokensid.txt'.

DescriptionID: a property of ads.

This is the key of 'descriptionid_tokensid.txt'.
UserID

This is the key of 'userid_profile.txt'. When we cannot identify the user, this field has a special value of 0.

ADDITIONAL DATA FILES

There are five additional data files, as mentioned in the above section:

queryid_tokensid.txt
purchasedkeywordid_tokensid.txt
titleid_tokensid.txt
descriptionid_tokensid.txt
userid_profile.txt

Each line of the first four files maps an id to a list of tokens, corresponding to the query, keyword, ad title, and ad description, respectively. In each line, a TAB character separates the id and the token set. A token can basically be a word in a natural language. For anonymity, each token is represented by its hash value. Tokens are delimited by the character ‘|’.

Each line of ‘userid_profile.txt’ is composed of UserID, Gender, and Age, delimited by the TAB character. Note that not every UserID in the training and the testing set will be present in ‘userid_profile.txt’. Each field is described below:

Gender:

'1' for male, '2' for female, and '0' for unknown.

Age:

'1' for (0, 12], '2' for (12, 18], '3' for (18, 24], '4' for (24, 30], '5' for (30, 40], and '6' for greater than 40.

# Loading libraries...
import pandas as pd    
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt 
import seaborn as sns

Preparing Data¶

# Load Training Data..

column  = ['Click', 'Impression', 'AdURL', 'AdId', 'AdvId', 'Depth', 'Pos', 'QId', 'KeyId', 'TitleId', 'DescId', 'UId']
orignal = pd.read_csv('training.txt', sep='\t', header=None, nrows = 5000000, names=column)
orignal.head()

# Load User Data..

user_col  = ['UId', 'Gender', 'Age']
user      = pd.read_csv('userid_profile.txt', sep='\t', header=None, names=user_col)
user.head()

# Load Query Data..

query_col = ['QId', 'Query']
query     = pd.read_csv('queryid_tokensid.txt', sep='\t', header=None, names=query_col)
query.head(5)

# Load Ad Description Data..

desc_col  = ['DescId', 'Description']
desc      = pd.read_csv('descriptionid_tokensid.txt', sep='\t', header=None, names=desc_col)
desc.head(5)

# Load Ad Title Data..

title_col = ['TitleId', 'Title']
title     = pd.read_csv('titleid_tokensid.txt', sep='\t', header=None, names=title_col)
title.head(5)

def count(sentence):
    '''
        (str) -> (int)
        Returns no. of words in a sentence.
    '''
    return len(str(sentence).split('|'))

# Count no. of words in a query issued by a user.

query['QCount'] = query['Query'].apply(count)

query.head(5)

# Query isn't required now, get rid of it.

del query['Query']
query.head()

# Count no. of words in title of an advertisement.

title['TCount'] = title['Title'].apply(count)

title.head()

# Advertisement Title isn't required now, get rid of it.

del title['Title']
title.head()

# Count no. of words in description of an advertisement.

desc['DCount'] = desc['Description'].apply(count)

desc.head()

# Advertisement Description isn't required now, get rid of it.

del desc['Description']
desc.head()

# Merging orignal with user, query, title & desc on appropriate keys to get data..

data = pd.merge(orignal, user,  on='UId')
data = pd.merge(data,    query, on='QId')
data = pd.merge(data,    title, on='TitleId')
data = pd.merge(data,    desc,  on='DescId')

data.head()

# Add target variable CTR to the dataset...

data['CTR'] = data['Click'] * 1.0 / data['Impression'] * 100
data.head()

# Basic Information about the data...

data.shape

(4952274, 18)

Note: We loaded 5M datapoints initially, after merger we have around 4.95M datapoints. What does this indicate ? Actually for a lot of user ids data is missing hence merge operation gets rid of such datapoints.

Data Analysis¶

# CTR(ad) = #Clicks(ad)/#Impressions(ad)

# Calculating net CTR for our dataset...

total_impressions = data['Impression'].sum()
total_clicks      = data['Click'].sum()
net_CTR           = total_clicks * 1.0 / total_impressions

print ('Net CTR: {0}'.format(round(net_CTR*100,2))), '%'

Net CTR: 4.2 %

total = data.shape[0]

# Percentage of unique users in the dataset...

print round(len(data.groupby('UId')) * 1.0 / total * 100, 2), '%'

19.62 %

# Percentage of unique queries in the dataset...

print round(len(data.groupby('QId')) * 1.0 / total * 100, 2), '%'

23.34 %

# Percentage of unique advertisements in the dataset...

print round(len(data.groupby('AdId')) * 1.0 / total * 100, 2) , '%'

4.29 %

# Percentage of unique advertisers in the dataset...

print round(len(data.groupby('AdvId')) * 1.0 / total * 100, 2), '%'

0.28 %

# Distribution of word count in a search query...

temp = data[['QCount']].copy()

print 'Maximum Length of a Query: ', temp['QCount'].max()
print 'Average Length of a Query: ', temp['QCount'].mean()

Maximum Length of a Query:  127
Average Length of a Query:  2.98855354126

sns.boxplot(x=None,y='QCount',data=temp)

<matplotlib.axes._subplots.AxesSubplot at 0x1dd07da50>

Clearly, data contains outliers. We will remove them before constructing box plots otherwise observation is difficult. This is shown as follows:

print 'Median No. of words in a Search query:', temp['QCount'].quantile(0.5)
print '3rd Quantile No. of words in a Search query:', temp['QCount'].quantile(0.75)

Median No. of words in a Search query: 3.0
3rd Quantile No. of words in a Search query: 4.0

# Remove outliers

temp = temp[temp['QCount'] < 10.0]
print 'Maximum Length of a Query: ', temp['QCount'].max()
print 'Average Length of a Query: ', temp['QCount'].mean()

Maximum Length of a Query:  9
Average Length of a Query:  2.94981899189

# Distribution of word count in a query...

plt.figure(figsize=(10,4))

plt.subplot(1, 2, 1)
plt.hist(temp['QCount'],
         color='green',
         bins=25,
         normed=False)
plt.xlabel('No. of words in a Query')

plt.subplot(1, 2, 2)
plt.boxplot(temp['QCount'],
            labels=['No. of words in a Query'],
            )

plt.tight_layout()

Conclusion:

No. of words in a search query issued by an user is mostly < 4.0 for 75% of search queries.
Median of words in a search query is around 3 words.

# Distribution of word count in an ad description

temp = data[['DCount']].copy()

print 'Maximum Length of an Ad Description: ', temp['DCount'].max()
print 'Average Length of an Ad Description: ', temp['DCount'].mean()

Maximum Length of an Ad Description:  47
Average Length of an Ad Description:  21.311864812

# Distribution of word count in description of an ad...

plt.figure(figsize=(10,4))

plt.subplot(1, 2, 1)
plt.hist(temp['DCount'],
         bins=100,
         color='red',
         normed=False)
plt.xlabel('No. of words in a Ad Description')

plt.subplot(1, 2, 2)
plt.boxplot(temp['DCount'],
            labels=['No. of words in a Ad Description'],
            )

plt.tight_layout()

Conclusion:

No. of words in the description of most of the advertisements (75%) are in the range of [15,25].

# Distribution of word count in an ad title

temp = data[['TCount']].copy()

print 'Maximum Length of an Ad Title: ', temp['TCount'].max()
print 'Average Length of an Ad Title: ', temp['TCount'].mean()

Maximum Length of an Ad Title:  32
Average Length of an Ad Title:  8.76529893136

# Distribution of word count in a ad title...

plt.figure(figsize=(10,4))

plt.subplot(1, 2, 1)
plt.hist(temp['TCount'],
         color='red',
         bins=100,
         normed=False)
plt.xlabel('No. of words in a Ad Title')

plt.subplot(1, 2, 2)
plt.boxplot(temp['TCount'],
            labels=['No. of words in a Ad Title'],
            )

plt.tight_layout()

Conclusion:

No. of words in the title of most of the advertisements (75%) are in the range of [7,12].

# Does less no. of words in a query indicate high CTR...

temp = data[['QCount', 'CTR']].copy()
temp.head()

temp = temp[temp['QCount'] <= 20]

result = temp.groupby('QCount').agg(['mean'])
result.head()

plt.figure(figsize=(5,5))

plt.bar(result.index, result[('CTR', 'mean')],
        color='red',
        width=0.4)
plt.xlabel('No. of words in a Query')
plt.ylabel('Avg. CTR')

plt.tight_layout()

Conclusion: As no. of words in a query inc. CTR on avg. dec.

# How is the word count of Ad Title related to CTR of an Ad...

temp = data[['TCount', 'CTR']].copy()

result = temp.groupby('TCount').agg(['mean'])
result.head()

plt.figure(figsize=(5,5))

plt.bar(result.index, result[('CTR', 'mean')],
        color='red',
        width=0.4)
plt.xlabel('No. of words in a Ad Description')
plt.ylabel('Avg. CTR')

plt.tight_layout()

Conclusion: Avg. Ad CTR is more or less distributed uniformly with no. of words in Ad description.

# Is there a relation between no. of times an ad appears (impression) & no. of it is clicked.

temp = data[['AdId', 'Impression', 'Click']].copy()
temp.head()

result = temp.groupby('AdId').agg(['mean'])
result.head(6)

x = result[('Impression', 'mean')]
y = result[('Click', 'mean')]
plt.scatter(x,
            y,
            c='green',
            s=100,
            marker='o',
            edgecolor=None)
plt.xlabel('No. of Impressions')
plt.ylabel('No. of Clicks')
plt.title('Relationship between Ad Impressions & Clicks')

<matplotlib.text.Text at 0x11c6f7290>

Conclusion: As no. of impressions of an advertisement inc. clicks are mostly ~ 0.

This indicates a very crucial aspect of human behaviour. As a user see the same ad again & again, they are less likely to click it.

# Let us see how Gender of a user has an impact on CTR

temp = data[['Gender', 'CTR']].copy()
temp.head()

result = temp.groupby('Gender').agg(['mean'])
result.head()

plt.figure(figsize=(5,5))

plt.bar(result.index, result[('CTR', 'mean')],
        color='red',
        width=0.3)
plt.xlabel('Gender')
plt.ylabel('Avg. CTR')

plt.tight_layout()

Conclusion: Gender of a user doesn't impact CTR of an advertisement.

# Let us see how Age of a user affect CTR:

temp = data[['Age', 'CTR']].copy()
temp.head()

result = temp.groupby('Age').agg(['mean'])
result.head(6)

plt.figure(figsize=(5,5))

plt.bar(result.index, result[('CTR', 'mean')],
        color='red',
        width=0.3)
plt.xlabel('Age')
plt.ylabel('Avg. CTR')

plt.tight_layout()

Conclusion: An aged user has higher avg. CTR as compared to a young user.

# How does an gender of an aged person affect CTR of an ad...

temp = data[['Gender', 'Age', 'CTR']].copy()
temp = temp[(temp['Age'] == 5) | (temp['Age'] == 6)] # filter aged users.
temp.head()

temp = temp[['Gender', 'CTR']].copy()
result = temp.groupby('Gender').agg(['mean'])
result.head()

plt.figure(figsize=(5,5))

plt.bar(result.index, result[('CTR', 'mean')],
        color='green',
        width=0.5)
plt.xlabel('Gender')
plt.ylabel('Avg. CTR')

plt.tight_layout()

Conclusion: CTR for an aged male or female user is almost similar & as usual higher than avg. CTR of male & female users across all age groups.

# Let us try to see if position of an ad has an affect on CTR

temp = data[['Pos', 'CTR']].copy()
temp.head()

result = temp.groupby('Pos').agg(['mean', 'count'])
result.head()

plt.figure(figsize=(10,4))

plt.subplot(1, 2, 1)
plt.bar(result.index, result[('CTR', 'mean')],
        color='red')
plt.xlabel('Position')
plt.ylabel('Avg. CTR')

plt.subplot(1, 2, 2)
plt.bar(result.index, result[('CTR', 'count')],
        color='green')
plt.xlabel('Position')
plt.ylabel('Frequency of Ads')

plt.tight_layout()

Conclusion: Clearly, the CTR for an advertisement which has a low position (more visible to user) is higher as compared to CTR of an advertisement with higher position(not directly visible).

Typically advertisement have lower position. [1,2]

# Let us try to see if depth of an ad has an affect on CTR

temp = data[['Depth', 'CTR']].copy()
temp.head()

result = temp.groupby('Depth').agg(['mean', 'count'])
result.head()

plt.figure(figsize=(10,4))

plt.subplot(1, 2, 1)
plt.bar(result.index, result[('CTR', 'mean')],
        color='red')
plt.xlabel('Depth')
plt.ylabel('Avg. CTR')

plt.subplot(1, 2, 2)
plt.bar(result.index, result[('CTR', 'count')],
        color='green')
plt.xlabel('Depth')
plt.ylabel('Frequency of Ads')

plt.tight_layout()

Conclusion:

Mostly depth of a Search Session is 2.
If depth if high (3) avg. CTR falls. This means if there as no. of ads in a Search Session inc. avg. CTR dec.

# Studying the role of Advertiser...

# What is my goal ? I am trying to find advertisers who have high CTR. Why ? 

# So that I can investigate Ads given by these advertisers. 

# This will help me understand what different are they doing as opposed to other advertisers.

# Once we have advertisers with high CTR, we can see how ad properties vary.

# We can know if an advertiser with high CTR ads use more words to describe ad, more words to describe ad title etc...

# Preparing data...

temp = data[['AdvId', 'CTR', 'DCount', 'TCount']].copy()
temp.head()

result = temp.groupby('AdvId').agg(['mean'])
result.head()

temp = pd.DataFrame()

temp['AdvId']    = result.index
temp['CTR']      = result[('CTR', 'mean')].get_values()
temp['DCount']   = result[('DCount', 'mean')].get_values()
temp['TCount']   = result[('TCount', 'mean')].get_values()

temp.head()

print 'No. of unique advertisers: ',temp.shape[0]

No. of unique advertisers:  13921

# Deciding how an advertiser qualifies as an advertiser with high CTR ads...

# Let us study distribution of avg. CTR...

f, (ax1, ax2) = plt.subplots(2)
sns.kdeplot(temp['CTR'], ax=ax1)
sns.boxplot(x=None,y='CTR',data=temp, ax=ax2)

<matplotlib.axes._subplots.AxesSubplot at 0x1d3346b50>

mean_advertiser_ctr = temp['CTR'].mean()
print 'Average CTR of Ads given by an advertiser: ', round(mean_advertiser_ctr, 2)

median_advertiser_ctr = temp['CTR'].median()
print 'Median CTR of Ads given by an advertiser: ', round(median_advertiser_ctr, 2)

third_quantile_advertiser_ctr = temp['CTR'].quantile(0.75)
print '3rd Quantile CTR of Ads given by an advertiser: ', round(third_quantile_advertiser_ctr, 2)

Average CTR of Ads given by an advertiser:  3.94
Median CTR of Ads given by an advertiser:  2.41
3rd Quantile CTR of Ads given by an advertiser:  5.13

	Impression	AdURL	AdId	AdvId	Depth	Pos	QId	KeyId	TitleId	DescId	UId
0	1	4298118681424644510	7686695	385	3	3	1601	5521	7709	576	490234
1	1	4860571499428580850	21560664	37484	2	2	2255103	317	48989	44771	490234
2	1	9704320783495875564	21748480	36759	3	3	4532751	60721	685038	29681	490234
3	1	13677630321509009335	3517124	23778	3	1	1601	2155	1207	1422	490234
4	1	3284760244799604489	20758093	34535	1	1	4532751	77819	266618	222223	490234

	QId	Query
0	0	12731
1	1	1545\|75\|31
2	2	383
3	3	518\|1996
4	4	4189\|75\|31

	DescId	Description
0	0	1545\|31\|40\|615\|1\|272\|18889\|1\|220\|511\|20\|5270\|1...
1	1	172\|46\|467\|170\|5634\|5112\|40\|155\|1965\|834\|21\|41...
2	2	2672\|6\|1159\|109662\|123\|49933\|160\|848\|248\|207\|1...
3	3	13280\|35\|1299\|26\|282\|477\|606\|1\|4016\|1671\|771\|1...
4	4	13327\|99\|128\|494\|2928\|21\|26500\|10\|11733\|10\|318...

	TitleId	Title
0	0	615\|1545\|75\|31\|1\|138\|1270\|615\|131
1	1	466\|582\|685\|1\|42\|45\|477\|314
2	2	12731\|190\|513\|12731\|677\|183
3	3	2371\|3970\|1\|2805\|4340\|3\|2914\|10640\|3688\|11\|834\|3
4	4	165\|134\|460\|2887\|50\|2\|17527\|1\|1540\|592\|2181\|3\|...

	QId	Query	QCount
0	0	12731	1
1	1	1545\|75\|31	3
2	2	383	1
3	3	518\|1996	2
4	4	4189\|75\|31	3

	Impression	Click
	mean	mean
AdId
1000031	1.000000	0.000000
1000467	2.315789	0.105263
1000468	1.000000	0.000000
1000469	1.000000	0.000000
1000471	1.000000	0.000000
1000473	2.000000	0.000000

	CTR
	mean
Age
1	4.457391
2	4.525747
3	4.384555
4	4.232053
5	4.746470
6	5.256566

	CTR
	mean	count
Depth
1	4.577864	1450766
2	4.799309	2514678
3	3.561618	986830

	AdvId	CTR	DCount	TCount
0	78	0.000000	2.000000	3.000000
1	80	0.000000	2.000000	3.000000
2	81	7.317073	26.000000	11.000000
3	82	3.693460	22.000000	5.000000
4	83	3.378378	17.783784	5.081081

	QCount	CTR
0	1	0.0
1	1	0.0
2	1	0.0
3	1	0.0
4	1	0.0

	CTR
	mean
QCount
1	3.707213
2	4.868142
3	4.902491
4	4.597467
5	4.271822

	CTR
	mean
TCount
1	5.188903
2	5.468239
3	4.382002
4	4.484546
5	4.361038

	Gender	CTR
0	1	0.0
1	1	0.0
2	2	0.0
3	0	0.0
4	2	0.0