Stack Overflow is the largest, most trusted online community for developers to learn, share their programming knowledge, and build their careers.
Stack Overflow is something which every programmer use one way or another. Each month, over 50 million developers come to Stack Overflow to learn, share their knowledge, and build their careers. It features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. As of April 2014 Stack Overflow has over 4,000,000 registered users, and it exceeded 10,000,000 questions in late August 2015. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript, C#, PHP, Android, jQuery, Python and HTML.
All of the data is in 2 files: Train and Test.
Train.csv contains 4 columns: Id,Title,Body,Tags.The questions are randomized and contains a mix of verbose text sites as well as sites related to math and programming. The number of questions from each site may vary, and no filtering has been performed on the questions (such as closed questions).
Test.csv contains the same columns but without the Tags, which you are to predict.
Size of Train.csv - 6.75GB
Size of Test.csv - 2GB
Number of rows in Train.csv = 6034195
Dataset contains 6034195 rows. The column in the table are:
Id - Unique identifier for each question
Title - The question's title
Body - The body of the question
Tags - The tags associated with the question (all lowercase, should not contain tabs '\t' or ampersands '&')
Analysing the data and plot the required graphs to show that these conclusions are True:
a. Presence of Redundant information. b. Questions Consists of text and code. c. More than 200 tags were used in at least 10,000 questions. d. Total of more than 33M unique words are present in the data. e. Majority of the most frequent tags were programming language. f. More than 50% of questions contain <code> tag.We are also interested to find:
a. Total number of questions? b. Average question size? c. Minimum and maximum number of tags per question? d. Average number of tags per question? e. Unique Tags f. Top 20 tags?
#Let's import pandas to load the csv file.
import pandas as pd
#Since the file is too large, we will break it into chunks.
df = pd.read_csv("Train.csv", chunksize=500000)
#Observe that 'df' is a 'TextFileReader' object.
df
#In order to get the data frame, we have to iterate through 'df' chunk by chunk
df.get_chunk(5)
We won't be using above method to load the data. Using SQLite seems more appropriate.
#Creating db file from csv
'''
from sqlalchemy import create_engine # database connection
import datetime as dt
disk_engine = create_engine('sqlite:///train.db')
start = dt.datetime.now()
chunksize = 180000
j = 0
index_start = 1
for df in pd.read_csv('train.csv', names=['Id', 'Title', 'Body', 'Tags'], \
chunksize=chunksize, iterator=True, encoding='utf-8', ):
df.index += index_start
j+=1
print '{} rows'.format(j*chunksize)
df.to_sql('data', disk_engine, if_exists='append')
index_start = df.index[-1] + 1
'''
#This method seems more appropriate to work with this much data.
#importing sqlite3
import sqlite3
#creating the connection with database file.
con = sqlite3.connect('train.db')
data = pd.read_sql_query("""SELECT Tags FROM data""", con)
#Always remember to close the database
con.close()
#Printing first 5 columns from our data frame
data.head()
#Observing the shape of our data frame.
data.shape
# Note: We have 6034195 data points.
# Let's now drop unwanted column.
data.drop(data.index[0], inplace=True)
#Printing first 5 columns from our data frame
data.head()
#Observing the lables of each column.
data.keys()
#Lets check for missing values.
data.info()
# Notice that this is not the correct observation.
# But it can be clearly seen that unique no of tags /
#would be much lesser than total datapoints.
data.describe()
Observation:
# Importing & Initializing the "CountVectorizer" object, which
#is scikit-learn's bag of words tool.
from sklearn.feature_extraction.text import CountVectorizer
#by default 'split()' will tokenize each tag using space.
vectorizer = CountVectorizer(tokenizer = lambda x: x.split())
# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of
# strings.
tag_dtm = vectorizer.fit_transform(data['Tags'])
#'get_feature_name()' gives us the vocabulary.
tags = vectorizer.get_feature_names()
#Printing unique tags.
print "Number of unique tags = %d"%len(tags)
#Lets look at the tags we have.
print tags[:10]
Observation:
We have 42084 unique tags.
#Lets now store the document term matrix in a dictionary.
terms = vectorizer.get_feature_names()
freqs = tag_dtm.sum(axis=0).A1
result = dict(zip(terms, freqs))
#Saving this dictionary to csv files.
import csv
with open('tag_counts_dict_dtm.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for key, value in result.items():
writer.writerow([key, value])
#Reading the tag_counts_dict_dtm.csv to pandas df
tag_df = pd.read_csv("tag_counts_dict_dtm.csv", names=['Tags', 'Counts'])
tag_df.head()
tag_df[tag_df.Counts>10000].head()
#More than 200 tags were used in at least 10,000 questions
# Store tags greater than 10K in one list
lst_tags_gt_10k = tag_df[tag_df.Counts>10000].Tags
#Print the length of the list
print '{} Tags are used more than 10000 times'.format(len(lst_tags_gt_10k))
# Store tags greater than 100K in one list
lst_tags_gt_100k = tag_df[tag_df.Counts>100000].Tags
#Print the length of the list.
print '{} Tags are used more than 100000 times'.format(len(lst_tags_gt_100k))
Observations:
#Storing the count of unique tag in list 'tag_count'
tag_count = tag_dtm.sum(axis=0).tolist()[0]
#Converting each value in the 'tag_count' to integer.
tag_count = [int(j) for j in tag_count]
print 'We have {} unique tags'.format(len(tag_count))
#Storing the count of tag in each question in list 'tag_count'
tag_quest_count = tag_dtm.sum(axis=1).tolist()
#Converting each value in the 'tag_quest_count' to integer.
tag_quest_count=[int(j) for i in tag_quest_count for j in i]
print 'We have total {} datapoints.'.format(len(tag_quest_count))
print(tag_quest_count[:5])
print "Maximum number of tags per question: %d"%max(tag_quest_count)
print "Minimum number of tags per question: %d"%min(tag_quest_count)
print "Avg. number of tags per question: %f"% ((sum(tag_quest_count)*1.0)/len(tag_quest_count))
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
sns.countplot(tag_quest_count, palette='RdBu')
plt.title("Number of tags in the questions ")
plt.xlabel("Number of Tags")
plt.ylabel("Number of questions")
plt.show()
Observations:
# Ploting word cloud
import matplotlib.pyplot as plt
from wordcloud import WordCloud
# Lets first convert the 'result' dictionary to 'list of tuples'
tup = result.items()
#Initializing WordCloud using frequencies of tags.
wordcloud = WordCloud( background_color='black',
width=1600,
height=800,
).generate_from_frequencies(tup)
fig = plt.figure(figsize=(30,20))
plt.imshow(wordcloud)
plt.axis('off')
plt.tight_layout(pad=0)
fig.savefig("tag.png")
plt.show()
Observations:
A look at the word cloud shows that "c#", "java", "php", "asp.net", "javascript", "c++" are some of the most frequent tags.
#Sorting the tags based on counts.
tag_df.sort_values(['Counts'], ascending=False, inplace=True)
tag_df.head(20)
#o/p: 20 most frequent tags
i=np.arange(20)
tag_df.head(20).plot(kind='bar')
plt.title('Frequency of top 20 tags')
plt.xticks(i, tag_df['Tags'])
plt.xlabel('Tags')
plt.ylabel('Counts')
plt.show()
Observations:
con = sqlite3.connect('train.db')
df = pd.read_sql_query('SELECT Title, Body, Tags, COUNT(*) as cnt_dup FROM data GROUP BY Title, Body, Tags Having COUNT(*)>1', con)
con.close()
df.head()
df.tail()
df.shape
df.cnt_dup.value_counts()
df["tag_count"] = df["Tags"].apply(lambda text: len(text.split(" ")))
df.tag_count.value_counts()
sns.countplot(x='tag_count', data=df, palette='RdBu')
plt.title("Number of tags in the questions ")
plt.xlabel("Number of Tags")
plt.ylabel("Number of questions")
plt.show()
Observations:
#Creating a new database with no duplicates
'''
disk_dup = create_engine("sqlite:///train_no_dup.db")
question_tag = []
no_dup = pd.DataFrame(question_tag, columns=['Title', 'Body', 'Tags'])
no_dup.to_sql('no_dup_train',disk_dup, if_exists="append")
'''
import sqlite3
import pandas as pd
#creating the connection with database file.
con = sqlite3.connect('train_no_dup.db')
no_dup_df = pd.read_sql_query("""SELECT Tags FROM no_dup_train""", con)
#Always remember to close the database
con.close()
#Observing the shape of our data frame.
no_dup_df.shape
# Note: We now have 4206315 data points.
# Let's now drop unwanted column.
no_dup_df.drop(no_dup_df.index[0], inplace=True)
#Printing first 5 columns from our data frame
no_dup_df.head()