 Sample Module

Statistics II - Correlation and Linear Regression

Introduction

The content of the prerequisite modules has dealt with the statistics of one variable. That is called univariate analysis because only one variable is analyzed. The business world rarely analyzes just one variable for relevant information. Often, two variables are analyzed. That is called bivariate analysis because two variables are analyzed together. An example is advertising expenditures and sales. Marketing managers are very concerned about using advertising to increase sales. The relationship between the two is found through joint analysis.

The simplest bivariate relationship is covariance.

Covariance

In the Introduction to Statistics module you learned about variance. Variance is a measure of dispersion of one variable around its means. Covariance is related, also a measure of dispersion, but it measures joint dispersion around both means simultaneously.

Imagine a Cartesian plane formed by two axes, one horizontal and the other vertical, that are the means of two variables X and Y. Say the X variable is horizontal and the Y variable is vertical. Four quadrants are formed by the two axes and the one in which both X and Y are greater than their respective means is labelled quadrant I. Your textbook should have a graph of such a plane. (We have such graphs in our full module, but have omitted them from this sample module.) The other quadrants are labelled counter-clockwise II, III, and IV. Pairs of X and Y values in the form of (x, y) represent points on this plane.

The dispersion of these points graphically demonstrate covariance in three patterns. When the points are predominantly in quadrants I and III making an upward sloping cloud, the covariance is positive. When the points are predominantly in quadrants II and IV making a downward sloping cloud, the covariance is negative. When the points are evenly distributed in all 4 quadrants making a round cloud with no slope, the covariance is zero.

Computation

Covariance is computed almost the same way as variance. The formula is Cov(X,Y) = Σ(X - X) * (Y - Y) / (n - 1). From this formula, you can see that when X and Y are both above their respective means or both below their respective means, the result is a positive covariance value. When X is below its mean when Y is above its mean, and vice versa, the result is a negative covariance. So a positive covariance means that the variables X and Y increase or decrease together. Their values move in the same direction. A negative covariance means that the X and Y values move in opposite directions, one increases when the other decreases. A covariance of zero means that the values of X and Y do not move in any predictable pattern.

Illustrative Example

```               X     Y    X - X   Y - Y   (X - X) * (Y - Y)
7     3      2      -1            -2
9     1      4      -3           -12
2     6     -3       2            -6
1     7     -4       3           -12
6     3      1      -1            -1
Total    25    20                          -33

X = 25/5 = 5.0.    Y = 20/5 = 4.0.  Cov(X,Y) = -33.0/(5-1) = -8.25.
```

Correlation

Two serious weaknesses of covariance are the absence of any upper or lower limits to the value and the awkward units of the value. The covariance value can range from negative infinity to positive infinity. This renders the magnitude of covariance meaningless. The covariance value has units that are the product of the units of X and Y. If X is advertising expenditure and Y is sales, the units of covariance are dollars-squared. These units have little intuitive meaning. To address these weaknesses, covariance is scaled by the standard deviations of variables X and Y. The results is correlation which has no units and whose value must fall between -1.0 and 1.0. The symbol for correlation is the Greek rho (ρ) for population correlation, and the lower case r for the sample correlation. The formula is Corr(X,Y) = Σ(X - X) * (Y - Y) / √[Σ(X - X)2] * √[Σ(Y - Y)2]. The (n - 1) term in denominator in the covariance formula cancels with the (n - 1) term in the denominator of the two standard deviations. So correlation is not directly affected by sample size.

Sums of Squares

The formula for correlation shows that its numerator is the numerator of covariance and its denominator is the square root of the numerators of the variance of X and the variance of Y. Remember the degrees of freedom, the n - 1 terms, cancel out. The numerator of variance, Σ(X - X)2, is known as the sum of squares, and so is the numerator of covariance. We use the symbol, SS, to represent the sum of squares. Its subscript indicates the variable or variables whose squares are being summed. Therefore, SSXX / (n-1) is the variance of X, and SSXY / (n-1) is the covariance of X and Y.

In the Introduction to Statistics module, we showed that there are two forms of the formula for variance. The difference is in the manner of computing the numerator of variance, the sum of squares. That equivalence is Σ(X - X)2 = ΣX2 - (ΣX)2/n.

The sums of squares of X: SSXX = ΣX2 - (ΣX)2/n.
The sums of squares of Y: SSYY = ΣY2 - (ΣY)2/n.
The sums of squares of X and Y: SSXY = ΣX*Y - (ΣX)*(ΣY)/n.

The formula for sample correlation using the sums of squares terms is: r = SSXY / √SSXX * √SSYY.

Illustrative Example

```               X     Y      X2     Y2     X*Y
7     3     49      9      21
9     1     81      1       9
2     6      4     36      12
1     7      1     49       7
6     3     36      9      18
Total    25    20    171    104      67

SSXX = 171 - 252/5 = 46.0.
SSYY = 104 - 202/5 = 24.0.
SSXY = 67 - 25*20/5 = -33.0.
Corr(X,Y) = r = -33.0 / (√46.0 * √24.0) = -0.9932.
```

Example 1.

A firm that manufactures and sells fookits believes that sales levels will increase should it engage in Internet advertising. In a test program, the data yielded a sum of squares for advertising (SSXX) of \$7,400, a sum of squares for sales (SSYY) of \$134,800, and a sum of squares for advertisng times sales (SSXY) of \$26,500.

r = 26,500 / (√7,400 * √134,800) = 0.8390.

Interpretation

Since correlation is scaled to have a value between -1 and +1, -1.0 ≤ r ≤ +1.0. Values of correlation that are equal to -1.0 or 1.0 reflect perfect positive, or direct, linear relationship and perfect negative, or inverse, linear relationship. Values close to -1 or 1 reflect a strong linear relationship. Values close to 0.0 reflect a weak linear relationship. A value of exactly 0.0 reflects no linear relationship and the variables X and Y are uncorrelated. The lack of correlation between two variables is evidence that they are independent of each other.

Test of Independence

The most important role for correlation is that of being subject to a test of whether or not its value is significantly different from 0.0. Upon showing that a correlation coefficient has a value that is not significantly different from 0.0, one can presume that the two variables involved are independent of each other. The variance of the correlation coefficient is (1 - r2) / (n - 2). It can be used to construct a confidence interval for the corelation coefficient or compute a t-score or Z-score, both of which are part of hypothesis testing.

Example 2.

A firm that manufactures and sells fookits believes that sales levels will increase should it engage in Internet advertising. In a test program, the 102 data points yielded a sum of squares for advertising (SSXX) of \$46,200, a sum of squares for sales (SSYY) of \$361,800, and a sum of squares for advertising times sales (SSXY) of \$32,600. What is the correlation between advertising and sales? Are advertising and sales uncorrelated, ie independent?

r = 32,600 / (√46,200 * √361,800) = 0.2557. Var(r) = (1 - .25572) / (102 - 2) = 0.0093461, S.D.(r) = √0.0093461 = 0.096675.

The 95% confidence interval for the correlation coefficient is ρ = r ± 1.96 * 0.096675 = 0.2557 ± 0.18948. ρ ∈ [0.0662, 0.4452]. The Z-score, for a test that the correlation coefficient is different from 0.0, is Z = (0.2557 - 0) / 0.096675 = 2.645.

The hypothesis statement for testing that the population correlation is significantly different from 0.0, that sales and advertising are not independent, is:

H0: ρ = 0.       HA: ρ ≠ 0.

The hypothesized value of ρ, 0.0, falls outside its 95% confidence interval; 0.0 < 0.0662. The calculated Z-score of 2.645 is greater than the critical Z-value of 1.960 and so falls into the rejection region. The null hypothesis is rejected. There is sufficient evidence to conclude that the population correlation coefficient is significantly greater than 0.0. The variables advertising and sales have a direct linear relationship.

Important Caveat

The correlation explained in this module is for bivariate analysis, yielding the simple coefficient of correlation. This correlation describes only linear relationships. There is a correlation, called the multiple coefficient of correlation that can describe non-linear relationships. It is possible for two variables to not have a linear relationship, and therefore appear uncorrelated, but have a non-linear relationship and be correlated in that manner.

Simple Linear Regression

Correlation tells us the direction and strength of a linear relationship between two variables. It does not tell us the magnitude of that relationship or the nature of that relationship. The magnitude of the relationship will tell us by how much one variable changes when the other increases by one unit. The nature of the relationship will tell us which variable 'reacts' to the change in the other.

Causation

A more technical way of expressing the notion that one variable is reacting to changes in another variable is to say that one variable causes another. The issue of causation is an important one that is addressed in graduate courses. There are techniques for determining and assessing causality. For under-graduate courses, and this module, we assume that one variable, the independent variable, causes the other variable, the dependent variable. We will always identify the independent and dependent variables for you.

When the concept of causation is combined with the notion of magnitiude of the relationship, we arrive at the central idea and utility of linear regression. The determination of the size and direction of the change in a dependent variable that is caused by a one unit increase in the independent variable.

The Sample Regression Line

The equation of a straight line is described by its slope and intercept on the vertical axis. This is the well known equation y = mx + b. In statistics, this equation is expressed as Y = b0 + b1 * X + e. Y is the dependent variable, X is the independent variable, b0 is the vertical axis intercept, b1 is the slope, and e is a random error term. Regression analysis finds the values for b0 and b1.

Once the value of the slope and intercept are found, they must be interpreted. The interpretation of the intercept is that it is the value of the dependent variable when the value of the independent variable is zero. The interpretation of the slope is that it is the change in the value of the dependent variable when the independent variable increases by one unit.

The Least Squares Technique

Your textbook should have a graph of several points with a line drawn through them as closely as possible. That is a scatter plot with a best fit line. (We have such graphs in our full module, but have omitted them from this sample module.) The difference between that best fit line and each plotted point is called an error. It is the 'e' from the sample regression line. The least squares technique finds the line that makes the sum of squares of those errors as small as possible. That technique yields the formulas for the least squares regression line.

Labelling the dependent variable as 'Y' and the independent variable as 'X', the formulas for computing the slope and intercept are:

Slope: b1 = SSXY / SSXX.     Intercept: b0 = Y - b1 * X.

Illustrative Example

```     ΣX = 750  ΣY = 2,500  ΣX2 = 8,200  ΣY2 = 286,800  ΣX*Y = 64,450  n = 100.

SSXX = 21,200 - 7502/100 = 2,575.0.
SSYY = 286,800 - 2,5002/100 = 224,300.0.
SSXY = 64,540 - 750*2,500/100 = 45,700.0.
X = 750/100 = 7.50.    Y = 2,500/100 = 25.0.
b1 = 45,700 / 2,575 = 17.74757.   b0 = 25.0 - 17.74757 * 7.50 = -108.10680.

The regression line is Y = -108.11 + 17.75 * X.
```

The interpretation of the slope and intercept is strongly influenced by the context of the problem. The context tells us the units of the two variables, of any logic to support causality, and of any correspondence to business theory.

Example 1.

The human resources manager of an engineering firm has data on salaries and years of experience for 87 civil engineers from several other firms. She wants to determine if the salaries paid to their own civil engineers is in line with the market salary. The dependent variable (Y) is salary in thousands of dollars. The independent variable (X) is years of experience.

```     ΣX = 1,325  ΣY = 10,440  ΣX2 = 21,438  ΣY2 = 1,426,740  ΣX*Y = 161,950
n = 87.

SSXX = 21,438 - 1,3252/87 = 1,258.402.
SSYY = 1,426,740 - 10,4402/87 = 173,940.0.
SSXY = 164,350 - 1,325*10,440/87 = 5,350.0.
X = 1,325/87= 15.23.  Y = 10,440/87 = 120.0.

b1 = 5,350 / 1,258.402 = 4.2514.   b0 = 120.0 - 4.2514 * 15.23 = 55.2508.

The regression line is Y = 55.25 + 4.25 * X.
```

The context tells us that Y is the salary of civil engineers with units of thousands of dollars, and X is the work experience of civil engineers is years. The slope must have units of thousands of dollars per year because it must yield a number that has the same units as Y. Y increases by 4.25 thousand dollars when X increases by 1 year. In regular English the interpretation of the slope is: The annual salary increase for civil engineers due to productivity increases over the year due to additional work experience is \$4,250. The intercept is the salary when X is 0.0, when a civil engineer has no work experience. The salary for someone with no work experience is called starting salary. The interpretation of the intercept is: The starting salary of civil engineers is \$55,251.