QuantX Research: May 2016

Tuesday 17 May 2016

Cointegration in R

Cointegration is one of the most appealing and most controversial analyses. It is appealing since it is the basis of the pair trading strategy. And is controversial since it has a property to break, sometimes for an extended period and additionally there are different techniques that from time to time produce conflicting results.

In a nutshell, the approach is:

(1) run an OLS linear regression (the coefficient beta of the regression is the hedge ratio)

(2) test the residuals for presence of unit root (via Augmented-Dickey-Fuller Unit Root Test). The residuals of the regression represent the spread. And the spread is in fact: Dependent variable – Hedge ratio * Independent variable. If we reject the Null Hypothesis of Unit Root we conclude that the spread is stationary and the two variables are mean-reverting. This also means that two variables are cointegrated.

An alternative technique to the ADF test of the residuals is the Johansen cointegration test (Trace and Eigenvalue test) applied to the analysed variables. However, in most of the practical guides I’ have read the ADF approach is preferred. I think because it is more intuitive and easier to follow than the Johansen test.

Looking at five ETFs: SPY (SPDR S&P 500), IYY (iShares Dow Jones), IWM (iShares Russel 2000), GLD (SPDR Gold Shares) and GDX (VanEck Vectors Gold Miners). The last two pairs are viewed somehow as a textbook example.

Codes and results are highlighted.

library(quantmod)

library(fUnitRoots) #for the ADF test

tickers <- c('GDX','GLD', 'IWM' , 'IYY','SPY')

startDate="2013-01-01"

endDate="2016-05-04"

f <- function(x) {

x1 <- getSymbols(x[1], from=startDate, to=endDate, auto.assign=FALSE)

x2 <- getSymbols(x[2], from=startDate, to=endDate,auto.assign=FALSE)

y <- merge(Ad(x1),Ad(x2))

eqn <- as.formula(paste(colnames(y), collapse=" ~ 0 + "))

m <- lm(eqn, data=y)

adf<-adfTest(m$residuals, type="nc")

cat("Period is from:", startDate, "to:", endDate, "\n", "Dependent variable and independent variable are:", colnames(y), "\n", "The value of the test statistics is:", adf@test$statistic, "\n")

cat("ADF p-value is", adf@test$p.value, "\n")

}

p <- combn(tickers, 2, FUN=f, simplify=FALSE)

Period is from: 2013-01-01 to: 2016-05-04

Dependent variable and independent variable are: GDX.Adjusted GLD.Adjusted

The value of the test statistics is: -3.413356

ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04

Dependent variable and independent variable are: GDX.Adjusted IWM.Adjusted

The value of the test statistics is: -3.717138

ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04

Dependent variable and independent variable are: GDX.Adjusted IYY.Adjusted

The value of the test statistics is: -3.647598

ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04

Dependent variable and independent variable are: GDX.Adjusted SPY.Adjusted

The value of the test statistics is: -3.629596

ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04

Dependent variable and independent variable are: GLD.Adjusted IWM.Adjusted

The value of the test statistics is: -3.073779

ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04

Dependent variable and independent variable are: GLD.Adjusted IYY.Adjusted

The value of the test statistics is: -2.953128

ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04

Dependent variable and independent variable are: GLD.Adjusted SPY.Adjusted

The value of the test statistics is: -2.91571

ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04

Dependent variable and independent variable are: IWM.Adjusted IYY.Adjusted

The value of the test statistics is: -0.9656891

ADF p-value is 0.3085453

Period is from: 2013-01-01 to: 2016-05-04

Dependent variable and independent variable are: IWM.Adjusted SPY.Adjusted

The value of the test statistics is: -0.8500551

ADF p-value is 0.3454008

Period is from: 2013-01-01 to: 2016-05-04

Dependent variable and independent variable are: IYY.Adjusted SPY.Adjusted

The value of the test statistics is: -0.9760423

ADF p-value is 0.3052455

Looking at the values of test statistics of the ADF test if they are greater in absolute terms than the critical values we can reject the Null Hypothesis of unit root and conclude that the residuals are stationary (we also see the p-values that are smaller than 1%, 5% or 10% levels of significance). Hence, for our 10 pairs we can see there is cointegration between 7 pairs:

GDX and GLD

GDX and IWM

GDX and IYY

GDX and SPY

GLD and IWM

GLD and IYY

GLD and SPY

(Please note that the regression distinguishes between dependent and independent variable in the respective pair).

A way to look at the standard Dickey Fuller test (differ with the Augmented-Dickey Fuller test with the number of lags -> in the standard Dickey Fuller test the number of lags is 0) of the residuals from the regression is to run again an OLS regression of first difference of residuals on lagged residuals. Let’s play a bit step-by-step with this and see what we get in R:

For the example, we will use ur.df function in R to get the ADF and standard DF test values.

startDate="2013-01-01"

endDate="2016-05-04"

GLD <- getSymbols("GLD", from=startDate, to=endDate, auto.assign=FALSE)

GDX <- getSymbols("GDX", from=startDate, to=endDate,auto.assign=FALSE)

y <- merge(Ad(GDX), Ad(GLD))

eqn <- as.formula(paste(colnames(y), collapse=" ~ 0 + "))

m <- lm(eqn, data=y)

ur_df_0<-ur.df(m$residuals, type="none", lags=0) #now we run a standard Dickey Fuller test with 0 lags and no constant

summary(ur_df_0) # to see the test stat and p-value

###############################################

# Augmented Dickey-Fuller Test Unit Root Test #

###############################################

Test regression none

Call:

lm(formula = z.diff ~ z.lag.1 - 1)

Residuals:

Min 1Q Median 3Q Max

-1.46514 -0.29108 -0.03333 0.25579 1.60938

Coefficients:

Estimate Std. Error t value Pr(>|t|)

z.lag.1 -0.01385 0.00370 -3.744 0.000193 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4456 on 839 degrees of freedom

Multiple R-squared: 0.01643, Adjusted R-squared: 0.01526

F-statistic: 14.02 on 1 and 839 DF, p-value: 0.0001935

Value of test-statistic is: -3.744

Critical values for test statistics:

1pct 5pct 10pct

tau1 -2.58 -1.95 -1.62

#now the long way to do this – via an OLS regression of the first difference of residuals against lagged residuals (lag=1)

n<-lm(diff(m$residuals) ~ lag(m$residuals[-length(m$residuals)], k=1) +0)

summary(n) #to see full output

# or alternatively to see only t-value and p-value:

summary(n)$coef[,3] # to see t-value

-3.743962

summary(n)$coef[,4] # to see p-value

0.0001934995

We have both approaches producing the same result: value of the test statistics is -3.744 and p-value of 0.000193.

Now, the Augmented-Dickey Fuller:

adf<-ur.df(m$residuals, type="none", lags=1)

summary(adf) # to see the test stat and p-value

###############################################

# Augmented Dickey-Fuller Test Unit Root Test #

###############################################

Test regression none

Call:

lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)

Residuals:

Min 1Q Median 3Q Max

-1.45579 -0.29098 -0.03797 0.24630 1.56707

Coefficients:

Estimate Std. Error t value Pr(>|t|)

z.lag.1 -0.01263 0.00370 -3.413 0.000673 ***

z.diff.lag -0.09048 0.03408 -2.655 0.008089 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4418 on 837 degrees of freedom

Multiple R-squared: 0.02144, Adjusted R-squared: 0.01911

F-statistic: 9.171 on 2 and 837 DF, p-value: 0.0001148

Value of test-statistic is: -3.4134

Critical values for test statistics:

1pct 5pct 10pct

tau1 -2.58 -1.95 -1.62

#now the long way to do this – via an OLS regression of the first difference of residuals against lagged residuals (lag=1) and the lagged first difference of residuals. Complicated? Yep, a bit but here it is:

z.diff ~ z.lag.1 - 1 + z.diff.lag

n<-lm((diff(m$residuals))[-1] ~ lag(m$residuals[-length(m$residuals)], k=1)[-1] + lag(diff(m$residuals)[-length(diff(m$residuals))], k=1) +0)

summary(n) #to see full output

# or alternatively to see only t-value and p-value:

summary(n)$coef[,3] # to see t-value

lag(m$residuals[-length(m$residuals)], k = 1)[-1]

-3.413356

lag(diff(m$residuals)[-length(diff(m$residuals))], k = 1)

-2.654686

summary(n)$coef[,4] # to see p-value

lag(m$residuals[-length(m$residuals)], k = 1)[-1]

0.0006725011

lag(diff(m$residuals)[-length(diff(m$residuals))], k = 1)

0.0080889750

The value of the test statistics is -3.413 (from ur.df) and the value of the test stat of the term of the lagged residuals from the linear regression is the same: -3.413356. (The same as the result of the ADF test of the 10 pairs).

It should also be noted, as Ernie Chan highlighted (http://epchan.blogspot.bg/2013/11/cointegration-trading-with-log-prices.html), the difference between price spreads and log price spreads. In the price spread the number of shares should be kept fixed, and short the spread when it is much higher than average (say 1.5 standard deviations above the historical average or some other number), and long the spread when it is much lower. For a stationary log price spread themarket values of stocks should be kept fixed, which means that at the end of every bar, we need to rebalance the shares due to price changes.

Subscribe to: Posts (Atom)