Tests of Independence

Topics

Introduction
Tests Concerning Correlation
- Parametric Test of a Correlation
- Non-Parametric Test of Correlation: The Spearman Rank Correlation Coefficient
Tests of Independence Using Contingency Table Data

Table of Contents

Introduction

Parametric and non-parametric tests of correlation
Tests of independence based on contingency table data

Tests Concerning Correlation

The strength of linear relationship between two variables is assessed through correlation coefficient. Significance is tested by using hypothesis tests concerning correlation.

The most common way of setting up hypotheses concerning correlation is to check if the population correlation is not equal to 0 → $H_{0} : ρ = 0$ (ρ represents the population correlation coefficient)

We can also set up hypothesis to check if the population correlation is positive or negative.

Parametric Test of a Correlation

As long as the two variables are distributed normally, we can use sample correlation, r for our hypothesis testing.

The formula for the t-test is

t = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^{2}}}

where n – 2 = degrees of freedom if $H_{0}$ is true. 0

The magnitude of r needed to reject the null hypothesis ( $H_{0} : ρ = 0$ ) decreases as sample size n increases due to the following:

Number of degrees of freedom increases and the absolute value of the critical value t decreases.
Absolute value of the numerator increases, leading to larger-magnitude t-values.

In other words, as n increases, the probability of Type-II error decreases, all else equal.

Example

The sample correlation between the oil prices and monthly returns of energy stocks in a Country A is 0.7986 for the period from January 2014 through December 2018. Can we reject a null hypothesis that the underlying or population correlation equals 0 at the 0.05 level of significance?

Solution:
$H_{0} : ρ = 0$ → True correlation in the population is 0.

t = \frac{0.7986 \sqrt{60 - 2}}{\sqrt{1 - {0.7986}^{2}}} = 10.1052

At the 0.05 significance level, the critical level for this test statistic is 2.00 (n = 60, degrees of freedom = 58).
We can reject the null hypothesis.

Non-Parametric Test of Correlation: The Spearman Rank Correlation Coefficient

If the two variables under consideration are not normally distributed, we can use a test based on the Spearman rank correlation coefficient, $r_{S}$ .

The Spearman rank correlation coefficient is equivalent to the usual correlation coefficient but is calculated on the ranks of two variables within their respective samples.

Steps

Sort the X observations from largest to smallest.
Assign the number 1 to the largest value observation, the number 2 to the second largest value observation, and so on.
In the event of a tie, assign the average of the ranks that the tied observations share to each tied observation.
Repeat the procedure for the observations on Y.
Calculate the difference in ranks, $d_{i}$ , for each pair of observations on X and Y, and then calculate $d_{i}^{2}$ (the squared difference in ranks).

For a sample size n, the spearman rank correlation is:

r_{S} = 1 - \frac{6 \sum_{i = 1}^{n} d_{i}^{2}}{n (n^{2} - 1)}

Example

Perform a Spearman rank correlation test based on this sample data. Determine whether to reject the null hypothesis at the 0.05 level of significance if the critical values are ±2.306.

| | Alpha | Expense Ratio | Rank by X | Rank by Y |
| --- | ----- | ------------- | --------- | --------- |
| 1 | -0.52 | 1.34 | 6 | 6 |
| 2 | -0.13 | 0.4 | 1 | 9 |
| 3 | -0.5 | 1.9 | 5 | 1 |
| 4 | -1.01 | 1.5 | 9 | 2.5 |
| 5 | -0.26 | 1.35 | 3 | 5 |
| 6 | -0.89 | 0.5 | 8 | 8 |
| 7 | -0.42 | 1 | 4 | 7 |
| 8 | -0.23 | 1.5 | 2 | 2.5 |
| 9 | -0.6 | 1.45 | 7 | 4 |

r_{S} = 1 - \frac{6 (144.5)}{9 (9^{2} - 1)} = 0.20416

t = \frac{- 0.20416 \sqrt{9 - 2}}{\sqrt{1 - {0.20416}^{2}}} = - 0.55177

Fail to reject the null hypothesis.

Tests of Independence Using Contingency Table Data

When dealing with categorical or discrete data presented in the form of a contingency table, we use a chi-squared distributed test statistic.

Suppose we want to test whether a relationship exists between the size and investment type, we can perform a test of independence using a chi-squared distributed test statistic.

This non parametric test compares actual observed frequencies with those expected on the basis of independence.

The test statistic is calculated as:

χ^{2} = \sum_{i = 1}^{m} \frac{(O_{i j} - E_{i j})^{2}}{E_{i j}}

where:

m = Number of cells in the table
$O_{i j}$ = Observed frequency
$E_{i j}$ = Expected frequency = $\frac{Total row i * Total column j}{Overall total}$

This test statistic has degrees of freedom of (r − 1)(c − 1), where r is the number of categories for the first variable and c is the number of categories of the second variable.

Example

Consider the following contingency table which classifies 1,594 ETFs based on two dimensions: size and investment type.

The 3 values in each cell are # of such companies, Expected Frequency, Scaled Squared Deviation.

| | Small | Medium | Large | Total |
| ------ | --------------------- | ----------------------- | ----------------------- | ----- |
| Value | 50
46.703
0.233 | 110
120.228
0.87 | 343
336.07
0.143 | 503 |
| Growth | 42
33.982
1.892 | 122
87.482
13.62 | 202
244.536
7.399 | 366 |
| Blend | 56
67.315
1.902 | 149
173.290
3.405 | 520
484.395
2.617 | 725 |
| Total | 148 | 381 | 1065 | 1594 |

Finally, we sum all the above values to get a chi-squared test statistic as 32.08025.

$H_{0}$ is ETF size and investment type are not related, so these classifications are independent.

With a (3-1) x (3 -1) = 4 degrees of freedom and a one-sided test with a 5% level of significance, the critical value is 9.4877.

Since the calculated chi-squared test statistic (32.08025) is greater than 9.4877, we reject the null hypothesis of independence and conclude that ETF size and investment type are related.