Why should we learn distributions?

Given 'n' observations of real-estate property prices and H be the random variable $$h_i \in \mathbb{R}$$ Let average price of property be Rs.5000000 i.e $$\mu = 5000000$$ and variance of the data be Rs.1500000 i.e $$ \sigma = 1500000 $$

if the random variable H is following normal distribution i.e $$ H \sim N(\mu = 50L, \sigma = 15L) $$

and the distribution curve looks like this

In [1]:
#plotting normal distribution curve given mean and variance assume that x-axis values 
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.mlab as mlab
import math

mu = 50
variance = 15
sigma = math.sqrt(variance)
x = np.linspace(mu-3*variance,mu+3*variance)
plt.plot(x,mlab.normpdf(x, mu, sigma))

plt.show()

As we know that for normally distributed data,$$ 65\% \ of \ distribution \ will \ be \ between \ \mu+\sigma \ to \ \mu-\sigma \ ( i.e\ 40L \ to\ 60L) $$ $$ 95\% \ of \ distribution \ will \ be \ between \ \mu+2*\sigma \ to \ \mu-2*\sigma \ (i.e\ 20L \ to\ 80L) $$ $$ 99.7\% \ of \ distribution \ will \ be \ between \ \mu+3*\sigma \ to \ \mu-3*\sigma \ (i.e\ 5L \ to\ 95L) $$

Note: if we know that given data belongs to one of the distributions then it is easy to understand the characteristics of data with the minimal parameters known(i.e in above example we dont know any of pricing values but yet we can tell about the data).

but there is a problem if we consider mean and standard deviation as parametes because when we have either corrupted values or wrongly mentioned values this directly effects our mean value for example, consider we have a sample of 6 house prices

$$ h_1 = 2L,\ h_2 = 23L,\ h_3 = 49L,\ h_4 = 51L,\ h_5 = 75L,\ h_6 = 1010L(i.e \ corrupted\ the\ original\ h_6=100L) $$

$$ mean(\mu) \ = \ \frac{2+23+49+51+75+1010}{6} $$

$$ mean(\mu)\ = 201.667 \ (i.e\ this\ tells\ nothing\ about\ the\ data)$$

So, when we have possibility of outliers or largely skewed values considering Mean and Variance needs to be taken care.

  1. Let us consider another example, suppose we have 10000 list of property prices of random variable H and have been told that the data is normally distributed. what is the mean and variance?*

given, $$ H \sim N(\mu_h,\ \sigma_h\ ) $$ $$ h_i \in \mathbb{R} $$

approach-1: $$ \mu_h = \frac{\sum_{i=0}^n h_i}{10000} $$ this has a problem because of the Outliers or Largely skewed values

approach-2: Central Limit theorem $$ Randomly\ pick\ a\ 100\ sample\ property\ prices\ and\ caliculate\ Mean\ of\ these\ and\ let\ it\ be\ "\mu_1" $$ $$ Randomly\ pick\ another\ 100\ sample\ property\ prices\ and\ caliculate\ Mean\ of\ these\ and\ let\ it\ be\ "\mu_2" $$ $$.$$ $$.$$ $$.$$ $$ Randomly\ pick\ another\ 100\ sample\ property\ prices\ and\ caliculate\ Mean\ of\ these\ and\ let\ it\ be\ "\mu_{n}" $$ caliculate the mean of all these Mean values, $$ \mu_h = \frac{\sum_{i=0}^n \mu_i}{n} $$

Note: in this case if 20% of the data gets corrupted $\mu_h$ is also corrupted. As the outliers increases CLT also fails.

approach-3: Considering the median $h_{i/2}$ (i.e after all the $h_i$'s are sorted). if our distribution is a gaussian distribution then Mean($\mu_{h}$) = Median($h_i$) Let us follow the incremental approach to caliculate the Median($h_i$) $$ \mu_h - 1L\ +\ \mu_h + 1L ==> X_1\% $$ $$ \mu_h - 2L\ +\ \mu_h + 2L ==> X_2\% $$ $$ \mu_h - 3L\ +\ \mu_h + 3L ==> X_3\% $$ $$.$$ $$.$$ $$.$$ $$ \mu_h - n*1L\ +\ \mu_h + n*1L ==> X_n\%\ (X_n = 65\%) $$ $$ ==> "n"\ is\ the\ variance\ $$

Note: if our distribution looks almost similar to the Gausian distribution but it is some what right skewed then the difference between the Gausian curve on right to our distribution is where the outliers or extreme data ocuurs.

Expectation:

  • Consider an example of the probability distribution of a discrete random variable X:
x012
p(x)0.160.480.36
Here X represents the random variable, 
where "x" represents the values that the random variable take on and 
p(x) represents probability that each value assigned to the random variable X

Expected value of Discrete Random Variable: Given a number of repeated trails, the average of the results will be approximately equal to the expected value. this is represented as E(X) $$ where\ E(X)\ = \sum_{i=1}^n x.p(x) $$ $$ x\ =\ value\ of\ the\ i^{th}\ outcome $$ $$ p\ =\ probability\ of\ the\ i^{th}\ outcome $$ $$ E(X)\ in\ other\ words\ it\ is\ considered\ as\ Mean(\mu)\ of\ the\ Random\ Variable $$

$$ E(X) = 0*0.16\ +\ 1*0.48\ +\ 2*0.36 $$$$ E(X)\ =\ 1.2 $$

Note : result of the E(X) is not the probable value from the given $x_i$, it is the theoritical Mean of the Random Variable X.

If suppose we need to caliculate expectation of a function g(x) then, $$ E(g(X))\ =\ \sum_{i=1}^n g(x).p(x) $$
$$ Here,\ we\ take\ the\ values\ of\ the\ function\ of\ g(x)\ and\ multiply\ thier\ each\ individual\ probabilities. $$

Home work: prove $ E((X\ -\ \mu)^2) =\ \sigma^2 ? $

  • consider another example of two random variables consisting of heights(inches) and weights(pounds). plot the graph using 2D-scatter plot?
In [2]:
#given heights in inches and width in pounds of asample of 200 people from birth to 18 years old
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


#Load heights.csv into a pandas dataFrame.
h_w = pd.read_csv("heights.csv")
In [3]:
# (Q) how many data-points and featrues are there?
print (h_w.shape)
(200, 2)
In [4]:
#(Q) What are the column names in our dataset?
print (h_w.columns)
Index([u'height(inches)', u'weight(pounds)'], dtype='object')
In [5]:
#2-D scatter plot:
#ALWAYS understand the axis: labels and scale.

h_w.plot(kind='scatter', x='height(inches)', y='weight(pounds)') ;
plt.show()

If we observe the scatter plot axis one is measured in "inches" and the other is measured in "pounds". These two measurments are no where related to each other.

In [6]:
# therefore we transform all the data column wise fitting each cell between the range 0 to 1.
from sklearn import preprocessing
import pandas as pd

x = h_w.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler() # x[0][i] = (x[0][i] - x[0].min)/ (x[0].max - x[0].min)
x_scaled = min_max_scaler.fit_transform(x)
h_w = pd.DataFrame(x_scaled)
In [7]:
#check the values whether they are transformed or not.
h_w.head()
Out[7]:
01
00.2244510.247134
10.7726840.632001
20.5702010.902882
30.4574980.727809
40.4164280.759908
In [8]:
h_w.columns
Out[8]:
RangeIndex(start=0, stop=2, step=1)
In [9]:
#Now we have two columns named as 0 and 1(i.e because whatever the feature is we have scaled it from 0 to 1).
h_w.plot(kind='scatter', x=0, y=1) ;
plt.show()

If "D" is our original DataSet After the transformation of our DataSet. Let us consider it as the $D_N$. $$ i.e\ D\ \sim\ D_N $$ $$ let\ us\ say\ the\ Avg_N^0\ as\ the\ new\ Mean\ of\ 0^{th}\ column.\ Avg_N^0\ =\ 0.5 $$ $$ and\ Avg_N^1\ as\ the\ new\ Mean\ of\ 1^{st}\ column.\ Avg_N^1\ =\ 0.5 $$

The transformation of the data from different scalea to a common scale if referred as "Data Normalization"

Important Observations about Data Normalization:

  1. Plots of D and $D_N$ looks similar because the distance of the dissimilarity is alomst maintained.
  2. There is a major disadvantage with this method because in future if there is a new extreme data point(i.e maximum value of any one of the feature) to be added then we need to change all the values and should replot it again.
  3. The Scale only ranges between 0 and 1.
  4. It makes the data dimensionless and all the knowledge of the location, scale of the original data may be lost(i.e You can see that in the 2D scatter plot of the D when transformed to $D_N$ features are renamed as just 0 and 1).

Data Standardization:

  1. Standardizing a dataset involves rescaling the distribution of values so that the Mean($\mu$) of observed values is 0 and the standard-deviation($\sigma$) is 1.
  2. Standardization assumes that your observations fit a Gaussian distribution(i.e bell curve) with a well behaved Mean($\mu$) and standard deviation($\sigma$).
  3. Standardization requires that you know or are able to accurately estimate the Mean and standard-deviation of observable values. You may be able to estimate these values from your training data.

Considering the Height in our given Distribution D we can standardize it as follows, $$ Height_s = \frac {h_i - \mu_{hightD}}{\sigma_{hightD}} $$

similarly, weight also caliculated as follows, $$ Weight_s = \frac {w_i - \mu_{weightD}}{\sigma_{weightD}}$$

In [10]:
# therefore we transform all the data column wise fitting each cell between the range 0 to 1.
from sklearn import preprocessing
import numpy as np
import pandas as pd
h_w = pd.read_csv("heights.csv")
X = h_w.values #returns a numpy array
X = preprocessing.scale(X)
h_w = pd.DataFrame(X)
h_w.head()
Out[10]:
01
0-1.121051-1.192853
11.8445830.776803
20.7492622.163105
30.1396021.267121
4-0.0825621.431399
In [11]:
import matplotlib.pyplot as plt
h_w.plot(kind='scatter', x=0, y=1) ;
plt.show()

important observations of Data Standardization:

  1. Plots of D and $D_S$ looks similar because the distance of the dissimilarity is alomst maintained.
  2. It makes the data dimensionless and all the knowledge of the location, scale of the original data may be lost(i.e You can see that in the 2D scatter plot of the D when transformed to $D_S$ features are renamed as just 0 and 1).
  3. $\mu =\ 0$ and $\sigma^2 =\ 1$.
  4. There is no problem even if we add extreme data points(i.e plots can be easily extended).