Stack overflow Tag Predictor
Python for Data Science Introduction
- Python, Anaconda and relevant packages installations
- Why learn Python?
- Keywords and identifiers
- comments, indentation and statements
- Variables and data types in Python
- Standard Input and Output
- Control flow: if else
- Control flow: while loop
- Control flow: for loop
- Control flow: break and continue
Python for Data Science: Data Structures
Python for Data Science: Functions
Python for Data Science: Numpy
Python for Data Science: Matplotlib
Python for Data Science: Pandas
Python for Data Science: Computational Complexity
Plotting for exploratory data analysis (EDA)
exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
- Introduction to IRIS dataset and 2D scatter plot
- 3D scatter plot
- Pair plots
- Limitations of pair plots
- Histogram and Introduction to PDF(Probability Density Function)
- Univariate Analysis using PDF
- CDF(Cumulative Distribution Function)
- Mean, Variance and Standard Deviation
- Percentiles and Quantiles
- Box-plot with Whiskers
- Violin Plots
- Summarizing Plots, Univariate, Bivariate and Multivariate analysis
- Multivariate Probability Density, Contour Plot
- Exercise: Perform EDA on Haberman dataset
It will give you the tools to help you with the other areas of mathematics required to understand and build better intuitions for machine learning algorithms.
- Why learn it ?
- Introduction to Vectors(2-D, 3-D, n-D) , Row Vector and Column Vector
- Dot Product and Angle between 2 Vectors
- Projection and Unit Vector
- Equation of a line (2-D), Plane(3-D) and Hyperplane (n-D), Plane Passing through origin, Normal to a Plane
- Distance of a point from a Plane/Hyperplane, Half-Spaces
- Equation of a Circle (2-D), Sphere (3-D) and Hypersphere (n-D)
- Equation of an Ellipse (2-D), Ellipsoid (3-D) and Hyperellipsoid (n-D)
- Hyper Cube,Hyper Cuboid
Probability and Statistics
Dimensionality reduction and Visualization:
In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, via obtaining a set of principal variables. It can be divided into feature selection and feature extraction.
Principal Component Analysis(PCA)
T-distributed stochastic neighborhood embedding (t-SNE)
Real world problem: Predict rating given product reviews on Amazon
Classification And Regression Models: K-Nearest Neighbors
Classification algorithms in various situations
Performance measurement of models
Solving optimization problems : Stochastic Gradient Descent
Stack Overflow Tag Prediction
Stack Overflow is the largest, most trusted online community for developers to learn, share their programming knowledge, and build their careers.
Stack Overflow is something which every programmer use one way or another. Each month, over 50 million developers come to Stack Overflow to learn, share their knowledge, and build their careers. It features questions and answers on a wide range of topics in computer programming.
Statement: (Multilabel Classification) A tag is a word or phrase that describes the topic of the question. Every question should have at least one tag, and can have up to five tags. Tags can be newly created by the user (if the user has reputation above 1500), or can be chosen from the list of tags available in the site. Tags help experts in finding the relevant questions that they can answer. Tags can also be used to find questions that are relevant or interesting to a user. Given this huge number of tags, it may be difficult for users to manually search appropriate tags while posting questions. Also, only users with good reputation can add new tags which in a way limit normal users from suggesting new tags
Since there are a huge number of tags, it is often a cumbersome process to search the correct tags. It may be useful to have an auto-tagging system that suggests tags to users depending on the content of the question.
Data Type:CSV files
train.csv (Id , title, body, tags)
Test.csv (id, title, body)
Data Size: 10GB
We are building our course content and teaching methodology to cater to the needs to students at various levels of expertise and varying background skills. This course can be taken by anyone with a working knowledge of a modern programming language like C/C++/Java/Python. We expect the average student to spend at least 5 hours a week over a 6 month period amounting to a 145+ hours of effort. More the effort, better the results. Here is a list of customers who would benefit from our course:
- Undergrad (BS/BTech/BE) students in engineering and science.
- Grad(MS/MTech/ME/MCA) students in engineering and science.
- Working professionals: Software engineers, Business analysts, Product managers, Program managers, Managers, Startup teams building ML products/services.
- Lectures 269
- Quizzes 0
- Duration 70+ hours
- Skill level All levels
- Language English
- Students 2
- Assessments Yes