Normal Distribution#

The Normal Distribution is the foundation of inferential statistics. The Normal Distribution represents the ideal population distribution for a sample that is approximately normal.

Normality#

Introduction#

Normality arises when observations being randomly drawn from a Population are independent and identically distributed. In other words, if a series of experiments are performed where each experiment is the same as the last in every respect, then the outcomes of all the experiments taken together should be approximately normal.

Important

Independence and Identically Distributed are mathematical concepts with precise defintions. We will talk more about them in the section on probability

In order to explain the origin of normality, it is instructive to consider a simple example. Consider the experiment of rolling a single die. Think about what the ideal relative frequency distribution for this experiment should look like. A die has six sides and each one is equally likely. If we let \mathcal{X} represent the outcome a rolling a single die, we can express the relation of all outcomes being equally likely with the following equation,

P(\mathcal{X}=1) = P(\mathcal{X}=2) = P(\mathcal{X}=3) = P(\mathcal{X}=4) = P(\mathcal{X}=5) = P(\mathcal{X}=6)

To say the same thing in a different way, the probability of all outcomes should be the same,

P(\mathcal{X}=i) = p \text{   i = 1, 2, 3, 4, 5, 6 }

The ideal histogram (in other words, the distribution of the population) would look perfectly uniform,

(Source code, png, hires.png, pdf)

../../_images/04_ex01_die_roll.png

Consider now the experiment of rolling 30 die. The relative frequency of each outcome in the ideal distribution will not change, since the new die being rolled consist of the same outcomes as the original die; Outcomes are added to the experiment in the same proportion.

Take this experiment of rolling 30 die and repeat

TODO

A departure from normality can suggest several things:

  1. The selection process was not random.

  2. The observations are not independent.

  3. The observations are not being drawn from the exact same population.

Normal Calculations#

When we calculate Normal probabilities, we usually work with Z distributions, where each observation x_i has been converted into a Z Score z_i,

z_i = \frac{x_i - \mu}{\sigma}

The reason for this transformation is easily understood by recalling the data transformation theorems that state the mean of a Z distribution will always be 0 and the standard deviation of a Z distribution will always be 1.

If an observation \mathcal{X} drawn from a population follows a Normal distribution with mean \mu and standard deviation \sigma, we write,

\mathcal{X} \sim \mathcal{N}(\mu, \sigma)

Then, the corresponding Z distribution can be written,

\mathcal{Z} \sim \mathcal{N}(0, 1)

TODO

Cumulative Distribution Function#

The cumulative distribution function (CDF) for the Normal distribution is an extremely important function in mathematics. Symbolically, it is written,

\Phi(z) = P(\mathcal{Z} \leq z) = p

This function represents the area under the density curve to the left of the point z. In other words, This function tells us the percentage p of the Standard Normal distribution that is less than or equal to the point z. To put it yet another way, it tells us what percentage p of the original Normal distribution is less than or equal to z standard deviations away from the mean.

Graphically, we can think of the Normal CDF at a point, \Phi(z) as representing the shaded area to the left of z. For example, the quantity \Phi(0.5) can be visualized as the shaded region under the density curve,

(Source code, png, hires.png, pdf)

../../_images/normal_distribution_cdf.png

Inverse Cumulative Distribution Function#

Every well-behaved function has an inverse. The CDF of the Normal Distribution is no different. The inverse CDF is denoted,

\Phi^{-1}(p) = z

The CDF tells us, given a value of z, what percent of the distribution is below z. The inverse CDF, on the other hand, tells us, given a value of p, what observation z corresponds to that percentile. It is the point z on the Normal density curve such that the shaded area below z is equal to p.

As an example, if we were interested in the 35 th percentile of the Standard Normal distribution, the inverse CDF would tell us the point z such that 35% of the distribution is less than or equal to that point, i.e. the point where the area to the left of the z is 35%.

(Source code, png, hires.png, pdf)

../../_images/normal_distribution_inverse.png

Symmetry#

TODO

Z-Tables#

These days we have calculators that can perform almost any calculation you can imagine, but back in the old days, aspiring mathematicians needed to be familiar with tables. Many functions in trigonometry and algebra do not have closed form algorithms for their exact calculation, so their values must be looked up in a table.

For example, sin(x) is a trigonometric quantity defined as the ratio of sides in a right triangle. It is, in general, impossible to calculate the exact value of sin(x) for an arbitrary x without more advanced techniques introduced in Calculus. For this reason, before the advent of modern computing, values of sin were tabulated in tables like the following,

(TODO: insert picture)

Similarly, the Standard Normal distribution is defined by a density curve whose area is not easily calculated without a substantial amount of math-power (like horse-power, but with math). In order to aid in calculations, statisticians of the past tabulated the values of the Standard Normal and devised a way of representing the CDF of the Standard Values through a two-way table,

../../_images/table_positive_z.png

This table can answers questions like,

P(\mathcal{Z} \leq 1.45)

First, we find the row that corresponds to the two leading digits, 1.4.

../../_images/table_positive_z_example_step1.png

Then, we find the column that corresonds to the last decimal spot, 0.05.

../../_images/table_positive_z_example_step2.png

This tells us that 92.65% of the Standard Normal distribution has a distance less than or equal to 1.45 standard deviations from the mean.

Empirical Rule#

TODO

The Empirical Rule can be visualized through the area underneath the Normal curve,

../../_images/normal_distribution_empirical_rule.png

TODO

Parameters#

Mean#

TODO

Varying the Mean Parameter#

TODO

Standard Deviation#

Varying the Standard Deviation Parameter#

By changing the Standard Deviation, the shape of the distribution changes. As the Standard Deviation increase, the graph spreads out. This is because Standard Deviation is a measure of variation. In other words, Standard Deviation quantifies how the distribution is spread out along the x-axis.

(Source code, png, hires.png, pdf)

../../_images/normal_distribution_parameters.png

Summary#

To summarize,

Assessing Normality#

TODO

QQ Plots#

A common technique for assessing the normality of a sample distribution is to generate a Quantile-Quantile Plot, or QQ Plot for short. QQ plots provide a visual representation of a sample’s normality by plotting the percentiles of a sample distribution against the percentiles of the theoretical Normal Distribution.

The exact steps for generating a QQ plot are given below,

  1. Find the :ref`order statistics <order_statistics>` of the distribution. In other words, sort the sample in ascending order.

Note

Step 1 is equivalent to finding the percentiles of the sample distribution.

  1. Standarize the sorted sample, i.e. find each observation’s Z Score.

  2. Find the theoretical percentiles from the Standard Normal Distribution for each ordered observation.

  3. Plot the actual percentiles versus the theoretical percentiles in the x-y plane.

Consider the following simplified example. Let the sample S be given by,

S = \{ 10, 15, 20, 30 \}

The sample statistics for this distribution are given by,

\bar{x} = 18.75

s \approx 8.54

Standardizing each observation and rounding to the second decimal spot,

Z = \{ -1.02, -0.44, 0.15, 1.32 \}

Then, we construct the theoretical percentiles of the Standard Normal distribution for a sample of size n = 4. To do so, we take the inverse CDF of the sample percentile,

\Phi^{-1}(\frac{i}{n+1})

For i = 1, 2, ... , n `. Note the denominator of :math:`n+1. If it is surprising the denominator is n+1 instead of n, read through the order statistics section. There are n observations, but these values divide the number line into n + 1 intervals.

In this example, we would find,

Z_{ \text{theoretical} } = \{ \Phi^{-1}(\frac{1}{5}), \Phi^{-1}(\frac{2}{5}), \Phi^{-1}(\frac{3}{5}), \Phi^{-1}(\frac{4}{5}) \}

Z_{\text{theoretical}} = \{ -0.842, -0.253, 0.253, 0.842 \}

After constructing the theoretical percentiles, we create a scatter plot using the order paired,

( actual percentile, theoretical percentiles )

If the sample distribution is Normal, we should observe a linear relationship between the x-value and the y-value of this scatter plot. The following plot is the QQ plot summarizes the normality of this example,

(Source code, png, hires.png, pdf)

../../_images/qq_plot_simple.png

We notice an approximately linear relationship between the observed percentiles and the theoretical percentile, and thus we conclude there is no evidence to suggest the distribution is not normal.

Important

The phrasing here is important! We have not shown the distribution is Normal. We have only provided evidence to contradict the claim the distribution is not Normal. In other words, we have demonstrated the falsity of a negative claim; we have not demonstrated the truth of a postive claim.

Relation To Other Distributions#

The Normal Distribution is deeply connected with many different areas of mathematics. It pops up everywhere, from quantum mechanics to finance. The reach of the normal distribution is far and wide.

Normal As An Approximation of the Binomial#

TODO

Poisson As An Approximation of the Normal#

TODO