Tuesday 17 May 2016

Cointegration in R

Cointegration is one of the most appealing and most controversial analyses. It is appealing since it is the basis of the pair trading strategy. And is controversial since it has a property to break, sometimes for an extended period and additionally there are different techniques that from time to time produce conflicting results.
In a nutshell, the approach is:
(1)    run an OLS linear regression (the coefficient beta of the regression is the hedge ratio)
(2)    test the residuals for presence of unit root (via Augmented-Dickey-Fuller Unit Root Test). The residuals of the regression represent the spread. And the spread is in fact: Dependent variable – Hedge ratio * Independent variable. If we reject the Null Hypothesis of Unit Root we conclude that the spread is stationary and the two variables are mean-reverting. This also means that two variables are cointegrated.
An alternative technique to the ADF test of the residuals is the Johansen cointegration test (Trace and Eigenvalue test) applied to the analysed variables. However, in most of the practical guides I’ have read the ADF approach is preferred. I think because it is more intuitive and easier to follow than the Johansen test.


Looking at five ETFs: SPY (SPDR S&P 500), IYY (iShares Dow Jones), IWM (iShares Russel 2000), GLD (SPDR Gold Shares) and GDX (VanEck Vectors Gold Miners). The last two pairs are viewed somehow as a textbook example.
Codes and results are highlighted.
library(quantmod) 
library(fUnitRoots) #for the ADF test

tickers <- c('GDX','GLD', 'IWM' , 'IYY','SPY')

startDate="2013-01-01"
endDate="2016-05-04"

f <- function(x) {
        x1 <- getSymbols(x[1], from=startDate, to=endDate, auto.assign=FALSE)
        x2 <- getSymbols(x[2], from=startDate, to=endDate,auto.assign=FALSE)
        y <- merge(Ad(x1),Ad(x2))
        eqn <- as.formula(paste(colnames(y), collapse=" ~ 0 + "))
        m <- lm(eqn, data=y)
        adf<-adfTest(m$residuals, type="nc")
        cat("Period is from:", startDate, "to:", endDate, "\n", "Dependent variable and independent variable are:", colnames(y), "\n", "The value of the test statistics is:", adf@test$statistic, "\n")
        cat("ADF p-value is", adf@test$p.value, "\n")
}

p <- combn(tickers, 2, FUN=f, simplify=FALSE)

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GDX.Adjusted GLD.Adjusted
 The value of the test statistics is: -3.413356
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GDX.Adjusted IWM.Adjusted
 The value of the test statistics is: -3.717138
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GDX.Adjusted IYY.Adjusted
 The value of the test statistics is: -3.647598
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GDX.Adjusted SPY.Adjusted
 The value of the test statistics is: -3.629596
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GLD.Adjusted IWM.Adjusted
 The value of the test statistics is: -3.073779
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GLD.Adjusted IYY.Adjusted
 The value of the test statistics is: -2.953128
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GLD.Adjusted SPY.Adjusted
 The value of the test statistics is: -2.91571
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: IWM.Adjusted IYY.Adjusted
 The value of the test statistics is: -0.9656891
ADF p-value is 0.3085453

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: IWM.Adjusted SPY.Adjusted
 The value of the test statistics is: -0.8500551
ADF p-value is 0.3454008

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: IYY.Adjusted SPY.Adjusted
 The value of the test statistics is: -0.9760423
ADF p-value is 0.3052455

Looking at the values of test statistics of the ADF test if they are greater in absolute terms than the critical values we can reject the Null Hypothesis of unit root and conclude that the residuals are stationary (we also see the p-values that are smaller than 1%, 5% or 10% levels of significance). Hence, for our 10 pairs we can see there is cointegration between 7 pairs:

GDX and GLD
GDX and IWM
GDX and IYY
GDX and SPY
GLD and IWM
GLD and IYY
GLD and SPY
(Please note that the regression distinguishes between dependent and independent variable in the respective pair).

A way to look at the standard Dickey Fuller test  (differ with the Augmented-Dickey Fuller test with the number of lags -> in the standard Dickey Fuller test the  number of lags is 0) of the residuals from the regression is to run again an OLS regression of first difference of residuals on lagged residuals. Let’s play a bit step-by-step with this and see what we get in R:

For the example, we will use ur.df function in R to get the ADF and standard DF test values.

startDate="2013-01-01"
endDate="2016-05-04"

GLD <- getSymbols("GLD", from=startDate, to=endDate, auto.assign=FALSE)
GDX <- getSymbols("GDX", from=startDate, to=endDate,auto.assign=FALSE)
y <- merge(Ad(GDX), Ad(GLD))
eqn <- as.formula(paste(colnames(y), collapse=" ~ 0 + "))
m <- lm(eqn, data=y)
ur_df_0<-ur.df(m$residuals, type="none", lags=0) #now we run a standard Dickey Fuller test with 0 lags and no constant
summary(ur_df_0) # to see the test stat and p-value
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################

Test regression none


Call:
lm(formula = z.diff ~ z.lag.1 - 1)

Residuals:
     Min       1Q   Median       3Q      Max
-1.46514 -0.29108 -0.03333  0.25579  1.60938

Coefficients:
        Estimate Std. Error t value Pr(>|t|)   
z.lag.1 -0.01385    0.00370  -3.744 0.000193 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4456 on 839 degrees of freedom
Multiple R-squared:  0.01643,  Adjusted R-squared:  0.01526
F-statistic: 14.02 on 1 and 839 DF,  p-value: 0.0001935


Value of test-statistic is: -3.744

Critical values for test statistics:
      1pct  5pct 10pct

tau1 -2.58 -1.95 -1.62


#now the long way to do this – via an OLS regression of the first difference of residuals against lagged residuals (lag=1)
n<-lm(diff(m$residuals) ~ lag(m$residuals[-length(m$residuals)], k=1) +0)
summary(n) #to see full output
# or alternatively to see only t-value and p-value:
summary(n)$coef[,3] # to see t-value
-3.743962

summary(n)$coef[,4] # to see p-value

0.0001934995


We have both approaches producing the same result: value of the test statistics is -3.744 and p-value of 0.000193.

Now, the Augmented-Dickey Fuller:
adf<-ur.df(m$residuals, type="none", lags=1)
summary(adf) # to see the test stat and p-value
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################

Test regression none


Call:
lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)

Residuals:
     Min       1Q   Median       3Q      Max
-1.45579 -0.29098 -0.03797  0.24630  1.56707

Coefficients:
           Estimate Std. Error t value Pr(>|t|)   
z.lag.1    -0.01263    0.00370  -3.413 0.000673 ***
z.diff.lag -0.09048    0.03408  -2.655 0.008089 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4418 on 837 degrees of freedom
Multiple R-squared:  0.02144,  Adjusted R-squared:  0.01911
F-statistic: 9.171 on 2 and 837 DF,  p-value: 0.0001148


Value of test-statistic is: -3.4134

Critical values for test statistics:
      1pct  5pct 10pct

tau1 -2.58 -1.95 -1.62

#now the long way to do this – via an OLS regression of the first difference of residuals against lagged residuals (lag=1) and the lagged first difference of residuals. Complicated? Yep, a bit but here it is:
z.diff ~ z.lag.1 - 1 + z.diff.lag

n<-lm((diff(m$residuals))[-1] ~ lag(m$residuals[-length(m$residuals)], k=1)[-1] + lag(diff(m$residuals)[-length(diff(m$residuals))], k=1) +0)

summary(n) #to see full output

# or alternatively to see only t-value and p-value:
summary(n)$coef[,3] # to see t-value
       lag(m$residuals[-length(m$residuals)], k = 1)[-1]
                                                -3.413356
lag(diff(m$residuals)[-length(diff(m$residuals))], k = 1)
                                                -2.654686


summary(n)$coef[,4] # to see p-value
        lag(m$residuals[-length(m$residuals)], k = 1)[-1]
                                             0.0006725011
lag(diff(m$residuals)[-length(diff(m$residuals))], k = 1)

                                             0.0080889750 
The value of the test statistics is -3.413 (from ur.df) and the value of the test stat of the term of the lagged residuals from the linear regression is the same: -3.413356. (The same as the result of the ADF test of the 10 pairs).

It should also be noted, as Ernie Chan highlighted (http://epchan.blogspot.bg/2013/11/cointegration-trading-with-log-prices.html), the difference between price spreads and log price spreads. In the price spread the number of shares should be kept fixed, and short the spread when it is much higher than average (say 1.5 standard deviations above the historical average or some other number), and long the spread when it is much lower. For a stationary log price spread themarket values of stocks should be kept fixed, which means that at the end of every bar, we need to rebalance the shares due to price changes.