Introduction

Stack Overflow is the largest, most trusted online community for developers to learn, share their programming knowledge, and build their careers.

Stack Overflow is something which every programmer use one way or another. Each month, over 50 million developers come to Stack Overflow to learn, share their knowledge, and build their careers. It features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. As of April 2014 Stack Overflow has over 4,000,000 registered users, and it exceeded 10,000,000 questions in late August 2015. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript, C#, PHP, Android, jQuery, Python and HTML.

Dataset Statistics

All of the data is in 2 files: Train and Test.

Train.csv contains 4 columns: Id,Title,Body,Tags.
Test.csv contains the same columns but without the Tags, which you are to predict.
Size of Train.csv - 6.75GB
Size of Test.csv - 2GB
Number of rows in Train.csv = 6034195
The questions are randomized and contains a mix of verbose text sites as well as sites related to math and programming. The number of questions from each site may vary, and no filtering has been performed on the questions (such as closed questions).

Data Field Explaination

Dataset contains 6034195 rows. The column in the table are:

Id - Unique identifier for each question
Title - The question's title
Body - The body of the question
Tags - The tags associated with the question (all lowercase, should not contain tabs '\t' or ampersands '&')

<img src = "SO.png">

Objective

Analysing the data and plot the required graphs to show that these conclusions are True:

a. Presence of Redundant information.
b. Questions Consists of text and code.
c. More than 200 tags were used in at least 10,000 questions.
d. Total of more than 33M unique words are present in the data.
e. Majority of the most frequent tags were programming language.
f. More than 50% of questions contain <code> tag.
We are also interested to find:
a. Total number of questions?
b. Average question size?
c. Minimum and maximum number of tags per question?
d. Average number of tags per question?
e. Unique Tags
f. Top 20 tags?

Loading the Data

In [1]:
#Let's import pandas to load the csv file.
import pandas as pd
#Since the file is too large, we will break it into chunks.
df = pd.read_csv("Train.csv", chunksize=500000)
In [2]:
#Observe that 'df' is a 'TextFileReader' object.
df
Out[2]:
<pandas.io.parsers.TextFileReader at 0x64a72e8>
In [3]:
#In order to get the data frame, we have to iterate through 'df' chunk by chunk
df.get_chunk(5)
Out[3]:
Id Title Body Tags
0 1 How to check if an uploaded file is an image w... <p>I'd like to check if an uploaded file is an... php image-processing file-upload upload mime-t...
1 2 How can I prevent firefox from closing when I ... <p>In my favorite editor (vim), I regularly us... firefox
2 3 R Error Invalid type (list) for variable <p>I am import matlab file and construct a dat... r matlab machine-learning
3 4 How do I replace special characters in a URL? <p>This is probably very simple, but I simply ... c# url encoding
4 5 How to modify whois contact details? <pre><code>function modify(.......)\n{\n $mco... php api file-get-contents

We won't be using above method to load the data. Using SQLite seems more appropriate.

Using Pandas with SQLite to Load the data

In [6]:
#Creating db file from csv
'''
from sqlalchemy import create_engine # database connection
import datetime as dt
disk_engine = create_engine('sqlite:///train.db')
start = dt.datetime.now()
chunksize = 180000
j = 0
index_start = 1

for df in pd.read_csv('train.csv', names=['Id', 'Title', 'Body', 'Tags'], \
                      chunksize=chunksize, iterator=True, encoding='utf-8', ):

    df.index += index_start

    j+=1
    print '{} rows'.format(j*chunksize)

    df.to_sql('data', disk_engine, if_exists='append')
    index_start = df.index[-1] + 1
'''
Out[6]:
"\nfrom sqlalchemy import create_engine # database connection\nimport datetime as dt\ndisk_engine = create_engine('sqlite:///train.db')\nstart = dt.datetime.now()\nchunksize = 180000\nj = 0\nindex_start = 1\n\nfor df in pd.read_csv('train.csv', names=['Id', 'Title', 'Body', 'Tags'],                       chunksize=chunksize, iterator=True, encoding='utf-8', ):\n\n    df.index += index_start\n\n    j+=1\n    print '{} rows'.format(j*chunksize)\n\n    df.to_sql('data', disk_engine, if_exists='append')\n    index_start = df.index[-1] + 1\n"
In [7]:
#This method seems more appropriate to work with this much data.
#importing sqlite3
import sqlite3
#creating the connection with database file.
con = sqlite3.connect('train.db')
data = pd.read_sql_query("""SELECT Tags FROM data""", con)
#Always remember to close the database
con.close()
In [8]:
#Printing first 5 columns from our data frame
data.head()
Out[8]:
Tags
0 Tags
1 php image-processing file-upload upload mime-t...
2 firefox
3 r matlab machine-learning
4 c# url encoding
In [9]:
#Observing the shape of our data frame.
data.shape
# Note: We have 6034195 data points.
Out[9]:
(6034196, 1)
In [10]:
# Let's now drop unwanted column.
data.drop(data.index[0], inplace=True)
#Printing first 5 columns from our data frame
data.head()
Out[10]:
Tags
1 php image-processing file-upload upload mime-t...
2 firefox
3 r matlab machine-learning
4 c# url encoding
5 php api file-get-contents
In [11]:
#Observing the lables of each column.
data.keys()
Out[11]:
Index([u'Tags'], dtype='object')
In [12]:
#Lets check for missing values.
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6034195 entries, 1 to 6034195
Data columns (total 1 columns):
Tags    object
dtypes: object(1)
memory usage: 92.1+ MB
In [13]:
# Notice that this is not the correct observation.
# But it can be clearly seen that unique no of tags /
#would be much lesser than total datapoints.
data.describe()
Out[13]:
Tags
count 6034195
unique 2110907
top android
freq 49783

Observation:

  1. We have 6034195 datapoints.
  2. Moreover there are no missing values in 'Tags' column.
  3. Unique tags are much smaller compared to total datapoints.

Tags

Total number of unique tags

In [15]:
# Importing & Initializing the "CountVectorizer" object, which 
#is scikit-learn's bag of words tool.
from sklearn.feature_extraction.text import CountVectorizer
#by default 'split()' will tokenize each tag using space.
vectorizer = CountVectorizer(tokenizer = lambda x: x.split())
# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
tag_dtm = vectorizer.fit_transform(data['Tags'])
In [16]:
#'get_feature_name()' gives us the vocabulary.
tags = vectorizer.get_feature_names()
#Printing unique tags.
print "Number of unique tags = %d"%len(tags)
#Lets look at the tags we have.
print tags[:10]
Number of unique tags = 42048
[u'.a', u'.app', u'.asp.net-mvc', u'.aspxauth', u'.bash-profile', u'.class-file', u'.cs-file', u'.doc', u'.drv', u'.ds-store']

Observation:
We have 42084 unique tags.

Tag Count

In [17]:
#Lets now store the document term matrix in a dictionary.
terms = vectorizer.get_feature_names()
freqs = tag_dtm.sum(axis=0).A1
result = dict(zip(terms, freqs))
In [18]:
#Saving this dictionary to csv files.
import csv
with open('tag_counts_dict_dtm.csv', 'wb') as csv_file:
    writer = csv.writer(csv_file)
    for key, value in result.items():
        writer.writerow([key, value])
In [19]:
#Reading the tag_counts_dict_dtm.csv to pandas df
tag_df = pd.read_csv("tag_counts_dict_dtm.csv", names=['Tags', 'Counts'])
tag_df.head()
Out[19]:
Tags Counts
0 mdbg 26
1 fouc 35
2 mdraid 6
3 screen-resolution 661
4 mms-streaming 18
In [26]:
tag_df[tag_df.Counts>10000].head()
#More than 200 tags were used in at least 10,000 questions
Out[26]:
Tags Counts
12776 c# 463526
30473 java 412189
12831 php 392451
22541 javascript 365623
23327 android 320622
In [27]:
# Store tags greater than 10K in one list
lst_tags_gt_10k = tag_df[tag_df.Counts>10000].Tags
#Print the length of the list
print '{} Tags are used more than 10000 times'.format(len(lst_tags_gt_10k))
# Store tags greater than 100K in one list
lst_tags_gt_100k = tag_df[tag_df.Counts>100000].Tags
#Print the length of the list.
print '{} Tags are used more than 100000 times'.format(len(lst_tags_gt_100k))
234 Tags are used more than 10000 times
19 Tags are used more than 100000 times

Observations:

  1. There are total 234 tags which are used more than 10000 times.
  2. 19 tags are used more than 100000 times.
  3. Most frequent tag (i.e. c#) is used 463526 times.

Tags Per Question

In [20]:
#Storing the count of unique tag in list 'tag_count'
tag_count = tag_dtm.sum(axis=0).tolist()[0]
#Converting each value in the 'tag_count' to integer.
tag_count = [int(j) for j in tag_count]
print 'We have {} unique tags'.format(len(tag_count))
#Storing the count of tag in each question in list 'tag_count'
tag_quest_count = tag_dtm.sum(axis=1).tolist()
#Converting each value in the 'tag_quest_count' to integer.
tag_quest_count=[int(j) for i in tag_quest_count for j in i]
print 'We have total {} datapoints.'.format(len(tag_quest_count))

print(tag_quest_count[:5])
We have 42048 unique tags
We have total 6034195 datapoints.
[5, 1, 3, 3, 3]
In [21]:
print "Maximum number of tags per question: %d"%max(tag_quest_count)
print "Minimum number of tags per question: %d"%min(tag_quest_count)
print "Avg. number of tags per question: %f"% ((sum(tag_quest_count)*1.0)/len(tag_quest_count))
Maximum number of tags per question: 5
Minimum number of tags per question: 1
Avg. number of tags per question: 2.885222
In [22]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
sns.countplot(tag_quest_count, palette='RdBu')
plt.title("Number of tags in the questions ")
plt.xlabel("Number of Tags")
plt.ylabel("Number of questions")
plt.show()

Observations:

  1. Maximum number of tags per question: 5
  2. Minimum number of tags per question: 1
  3. Avg. number of tags per question: 2.885222
  4. Most of the questions are having 2 or 3 tags

Most Frequent Tags

In [23]:
# Ploting word cloud
import matplotlib.pyplot as plt
from wordcloud import WordCloud
# Lets first convert the 'result' dictionary to 'list of tuples'
tup = result.items()
#Initializing WordCloud using frequencies of tags.
wordcloud = WordCloud(    background_color='black',
                          width=1600,
                          height=800,
                    ).generate_from_frequencies(tup)

fig = plt.figure(figsize=(30,20))
plt.imshow(wordcloud)
plt.axis('off')
plt.tight_layout(pad=0)
fig.savefig("tag.png")
plt.show()

Observations:
A look at the word cloud shows that "c#", "java", "php", "asp.net", "javascript", "c++" are some of the most frequent tags.

A look at the top 20 tags

In [24]:
#Sorting the tags based on counts.
tag_df.sort_values(['Counts'], ascending=False, inplace=True)
tag_df.head(20) 
#o/p: 20 most frequent tags
Out[24]:
Tags Counts
12776 c# 463526
30473 java 412189
12831 php 392451
22541 javascript 365623
23327 android 320622
4153 jquery 305614
3573 c++ 199280
12552 python 184928
29481 iphone 183573
2216 asp.net 177334
12221 mysql 172182
29852 html 165507
30917 .net 162359
39208 ios 136080
32694 objective-c 133932
8943 sql 132465
38206 css 129107
35750 linux 127606
1121 ruby-on-rails 116883
29129 windows 98100
In [25]:
i=np.arange(20)
tag_df.head(20).plot(kind='bar')
plt.title('Frequency of top 20 tags')
plt.xticks(i, tag_df['Tags'])
plt.xlabel('Tags')
plt.ylabel('Counts')
plt.show()

Observations:

  1. Majority of the most frequent tags are programming language.
  2. C# is the top most frequent programming language.
  3. Android, IOS, Linux and windows are among the top most frequent operating systems.

Checking for Duplicates

In [32]:
con = sqlite3.connect('train.db')
df = pd.read_sql_query('SELECT Title, Body, Tags, COUNT(*) as cnt_dup FROM data GROUP BY Title, Body, Tags Having COUNT(*)>1', con)
con.close()
In [33]:
df.head()
Out[33]:
Title Body Tags cnt_dup
0 java.sql.SQLException:[Microsoft][ODBC Dri... <p>I use the following code</p>\n\n<pre><code>... java jdbc 2
1 btnAdd click event opens two window after r... <p>i m opening window(search.aspx)using below ... javascript asp.net web 3
2 Countable subadditivity of the Lebesgue measure <p>Let $\lbrace F_n \rbrace$ be a sequence of ... real-analysis measure-theory 3
3 HQL equivalent to this Sql Query <pre><code>select part.PaId,part.PaName,part.P... hibernate hql 2
4 "Uncaught TypeError: Property 'addListGroup' ... <p>Asked this the other day but didn't get a r... javascript listbox 2
In [34]:
df.tail()
Out[34]:
Title Body Tags cnt_dup
1550026 ♥ character displayed incorrectly in Glyphs <p>I am using the Glyphs class to display text... c# silverlight windows-phone-7 2
1550027 打包时怎么忽略掉不需要打包的文件,tar命令 <p>打包时怎么忽略掉不需要打包的文件,如果用tar命令怎么打包?</p>\n\n<p><s... linux 2
1550028 (wordpress)Plugin Error:You do not have suffic... <p>My plugin which adds a menu in admin page h... wordpress wordpress-plugin 2
1550029 � in base64 encoded emails <p>I have a problem with Swedish language + MS... php email outlook mime 2
1550030 � odd character <p>� Odd Character when request file via ajax ... html 2
In [35]:
df.shape
Out[35]:
(1550031, 4)
In [36]:
df.cnt_dup.value_counts()
Out[36]:
2    1272336
3     277575
4         90
5         25
6          5
Name: cnt_dup, dtype: int64
In [37]:
df["tag_count"] = df["Tags"].apply(lambda text: len(text.split(" ")))
In [41]:
df.tag_count.value_counts()
Out[41]:
3    443599
2    419311
4    291553
1    219854
5    175714
Name: tag_count, dtype: int64
In [45]:
sns.countplot(x='tag_count', data=df, palette='RdBu')
plt.title("Number of tags in the questions ")
plt.xlabel("Number of Tags")
plt.ylabel("Number of questions")
plt.show()

Observations:

  1. More than 1.5M rows are exact duplicates and hence removing them will significantly reduce the size of the data.
  2. The data contains many special unusual characters including Chinese characters.

Removing Duplicates

In [153]:
#Creating a new database with no duplicates
'''
disk_dup = create_engine("sqlite:///train_no_dup.db")
question_tag = []
no_dup = pd.DataFrame(question_tag, columns=['Title', 'Body', 'Tags'])
no_dup.to_sql('no_dup_train',disk_dup, if_exists="append")
'''
Out[153]:
'\ndisk_dup = create_engine("sqlite:///train_no_dup.db")\nquestion_tag = []\nno_dup = pd.DataFrame(question_tag, columns=[\'Title\', \'Body\', \'Tags\'])\nno_dup.to_sql(\'no_dup_train\',disk_dup, if_exists="append")\n'

Analysis after removing duplicates

Tags

In [1]:
import sqlite3
import pandas as pd
#creating the connection with database file.
con = sqlite3.connect('train_no_dup.db')
no_dup_df = pd.read_sql_query("""SELECT Tags FROM no_dup_train""", con)
#Always remember to close the database
con.close()
In [2]:
#Observing the shape of our data frame.
no_dup_df.shape
# Note: We now have 4206315 data points.
Out[2]:
(4206315, 1)
In [3]:
# Let's now drop unwanted column.
no_dup_df.drop(no_dup_df.index[0], inplace=True)
#Printing first 5 columns from our data frame
no_dup_df.head()
Out[3]:
Tags
1 php image-processing file-upload upload mime-t...
2 firefox
3 r matlab machine-learning
4 c# url encoding
5 php api file-get-contents