Correlation#

Up to now we have been dealing with univariate data. In this section, we begin the study of bivariate data.

Definitions#

Bivariate Data#

S = \{ (x_1, y_1), (x_2, y_2), ... , (x_n, y_n) \}

When two variables are measured on one individual, we call such data bivariate.

The x variable is sometimes called the independent or predictor variable. The y variable is sometimes called the dependent or response variable. It is important to understand this terminology is only used in the context of the Linear Regression model introduced in the next section. Before the statistical significance of the relationship is established, this terminology is misleading, as it implies a relationship between the x and y variable when none may exist. Correlation can be measured between variables that have no relationship whatsoever; in such instance we call the variables uncorrelated.

Important

Because we are dealing with randomness, uncorrelated variables will not necessarily have a correlation of 0. In fact, correlations of 0 are never observed in practice. There will always be a non-zero correlation between any given variables; the task of statistics is to determine whether or not this correlation is significant enough to use the outcome of one variable to make predictions about the outcome of the other variable.

Correlation#

Correlation is a measure of the strength of a relationship that exists between two observable variables. Before we can begin our study of correlation, let’s make some preliminary defintions that will help us keep everything clear and precise.

Univariate Statistics#

In order to differentiate between the statistics relationing to the x and y variables, we introduce some notation.

\bar{x} and \bar{y} are defined as the univariate sample means of the x and y variables. In other words, \bar{y} is the sample mean of the y variable, as if we were observing the y variable in isolation. Similarly for \bar{x}.

s_x and s_y are defined as the univariate standard deviations of the x and y variables. In other words, s_x is the standard deviation of the x variable, as if we were observing the x variable in isolation. Similarly, for s_y.

s_{x}^2 = \frac{1}{n-1} \cdot \sum_{i=1}^{n} (x_i - \bar{x})^2

s_{y}^2 = \frac{1}{n-1} \cdot \sum_{i=1}^{n} (y_i - \bar{y})^2

TODO

s_x and s_y are defined as the univariate standard deviations of the x and y variables. In other words, s_x is the standard deviation of the x variable, as if we were observing only x alone. Similarly, for s_y.

s_{x}^2 = \frac{1}{n-1} \cdot \sum_{i=1}^{n} (x_i - \bar{x})^2

s_{y}^2 = \frac{1}{n-1} \cdot \sum{i=1}^{n} (y_i - \bar{y})^2

TODO

Definition#

TODO

TODO: justification. make some plots.

Scatter Plots#

A scatterplot is a very simple and easy to understand graphical representation of bivariate data. The x variable is plotted on the horizontal axis versus the y variable on the vertical axis. Graphs of scatterplots are classified based on the direction of the relationship observed, the strength of the relationship observed and the linearity of the relationship observed.

Direction#

No Correlation#

A scatterplot with no correlation between the x and y variables should appear random,

(Source code, png, hires.png, pdf)

../../_images/scatterplot_no_correlation.png

Positive Correlation#

A scatterplot with a positive correlation betwen the x and y variables should have a general upward trend,

(Source code, png, hires.png, pdf)

../../_images/scatterplot_positive_correlation.png

Negative Correlation#

(Source code, png, hires.png, pdf)

../../_images/scatterplot_negative_correlation.png

Strength#

Strong#

TODO

Weak#

TODO

Linearity#

Linear#

TODO

Non-Linear#

TODO

Time Series#

A time series is similar to a scatter plot in almost all ways, except the independent variable in a time series is always a unit of time. A correlation for a time series is called a trend.

Positive Trend#

(Source code, png, hires.png, pdf)

../../_images/timeseries_positive_trend.png

Negative Trend#

(Source code, png, hires.png, pdf)

../../_images/timeseries_negative_trend.png

No Trend#

(Source code, png, hires.png, pdf)

../../_images/timeseries_no_trend.png