Friday, 5 August 2016

Cointegration: ADF Test Critical Values


The second step in the Engle-Granger cointegration approach is to test if the residuals from the regression have unit root via ADF test. When we apply the ADF test on residuals (estimates) instead on actual time-series we can not use the Dickey and Fuller critical values and p-values that are reported. The critical values when we use the ADF test on residuals are stricter than the original critical values (this means that the critical values are lower and thus it is less likely to reject null hypothesis of unit root). There are several sets of critical values – as Engle and Yoo (1987), MacKinnon (1991), Phillips and Ouliaris (1990).

Phillips and Ouliaris critical values are available here: http://finpko.faculty.ku.edu/myssi/FIN938/Phillips%20%26%20Ouliaris_Asymp%20Props%20of%20Resid%20Based%20Tests%20for%20Coint_Econometrica_1990.pdf – Table IIa (no intercept and no trend), IIb (intercept but no trend) and IIc  (both intercept and trend).
We can express formally the three equations as follows:


From the table above, if we have one explanatory variable, constant only at 5% level of significance, the critical value is -3.37 (regression b case); in the case of both constant and trend, the critical value is -3.8.

This is for Phillips and Ouliaris critical values. MacKinnon (http://qed.econ.queensu.ca/working_papers/papers/qed_wp_1227.pdf) approaches the situation differently – he estimates response surface regression and the function used to calculate the critical values is:


Here is a table with MacKinnon critical values, corresponding to the second case regression – intercept, no trend (when comparing the critical values, it should be noted that in MacKinnon approach N is the number of cointegrating variables, while in Phillips and Ouliaris N is number of explanatory variables):


Based on the same procedure I interpolated the critical values for the third case – intercept and trend:


From the table with MacKinnon critical values at 5% level of significance if we have two variables with 200 observations and we have only a constant included in the ADF-regression, the critical value is -3.368; in case of 500 observations, the critical value is -3.350.
If we have both constant and trend, at 5% level of significance and 200 observations, the MacKinnon critical value is -3.828.
It is generally accepted that if intercept is included in the cointegrating regression, it is omitted in the ADF equation.




Tuesday, 17 May 2016

Cointegration in R

Cointegration is one of the most appealing and most controversial analyses. It is appealing since it is the basis of the pair trading strategy. And is controversial since it has a property to break, sometimes for an extended period and additionally there are different techniques that from time to time produce conflicting results.
In a nutshell, the approach is:
(1)    run an OLS linear regression (the coefficient beta of the regression is the hedge ratio)
(2)    test the residuals for presence of unit root (via Augmented-Dickey-Fuller Unit Root Test). The residuals of the regression represent the spread. And the spread is in fact: Dependent variable – Hedge ratio * Independent variable. If we reject the Null Hypothesis of Unit Root we conclude that the spread is stationary and the two variables are mean-reverting. This also means that two variables are cointegrated.
An alternative technique to the ADF test of the residuals is the Johansen cointegration test (Trace and Eigenvalue test) applied to the analysed variables. However, in most of the practical guides I’ have read the ADF approach is preferred. I think because it is more intuitive and easier to follow than the Johansen test.


Looking at five ETFs: SPY (SPDR S&P 500), IYY (iShares Dow Jones), IWM (iShares Russel 2000), GLD (SPDR Gold Shares) and GDX (VanEck Vectors Gold Miners). The last two pairs are viewed somehow as a textbook example.
Codes and results are highlighted.
library(quantmod) 
library(fUnitRoots) #for the ADF test

tickers <- c('GDX','GLD', 'IWM' , 'IYY','SPY')

startDate="2013-01-01"
endDate="2016-05-04"

f <- function(x) {
        x1 <- getSymbols(x[1], from=startDate, to=endDate, auto.assign=FALSE)
        x2 <- getSymbols(x[2], from=startDate, to=endDate,auto.assign=FALSE)
        y <- merge(Ad(x1),Ad(x2))
        eqn <- as.formula(paste(colnames(y), collapse=" ~ 0 + "))
        m <- lm(eqn, data=y)
        adf<-adfTest(m$residuals, type="nc")
        cat("Period is from:", startDate, "to:", endDate, "\n", "Dependent variable and independent variable are:", colnames(y), "\n", "The value of the test statistics is:", adf@test$statistic, "\n")
        cat("ADF p-value is", adf@test$p.value, "\n")
}

p <- combn(tickers, 2, FUN=f, simplify=FALSE)

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GDX.Adjusted GLD.Adjusted
 The value of the test statistics is: -3.413356
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GDX.Adjusted IWM.Adjusted
 The value of the test statistics is: -3.717138
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GDX.Adjusted IYY.Adjusted
 The value of the test statistics is: -3.647598
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GDX.Adjusted SPY.Adjusted
 The value of the test statistics is: -3.629596
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GLD.Adjusted IWM.Adjusted
 The value of the test statistics is: -3.073779
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GLD.Adjusted IYY.Adjusted
 The value of the test statistics is: -2.953128
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GLD.Adjusted SPY.Adjusted
 The value of the test statistics is: -2.91571
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: IWM.Adjusted IYY.Adjusted
 The value of the test statistics is: -0.9656891
ADF p-value is 0.3085453

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: IWM.Adjusted SPY.Adjusted
 The value of the test statistics is: -0.8500551
ADF p-value is 0.3454008

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: IYY.Adjusted SPY.Adjusted
 The value of the test statistics is: -0.9760423
ADF p-value is 0.3052455

Looking at the values of test statistics of the ADF test if they are greater in absolute terms than the critical values we can reject the Null Hypothesis of unit root and conclude that the residuals are stationary (we also see the p-values that are smaller than 1%, 5% or 10% levels of significance). Hence, for our 10 pairs we can see there is cointegration between 7 pairs:

GDX and GLD
GDX and IWM
GDX and IYY
GDX and SPY
GLD and IWM
GLD and IYY
GLD and SPY
(Please note that the regression distinguishes between dependent and independent variable in the respective pair).

A way to look at the standard Dickey Fuller test  (differ with the Augmented-Dickey Fuller test with the number of lags -> in the standard Dickey Fuller test the  number of lags is 0) of the residuals from the regression is to run again an OLS regression of first difference of residuals on lagged residuals. Let’s play a bit step-by-step with this and see what we get in R:

For the example, we will use ur.df function in R to get the ADF and standard DF test values.

startDate="2013-01-01"
endDate="2016-05-04"

GLD <- getSymbols("GLD", from=startDate, to=endDate, auto.assign=FALSE)
GDX <- getSymbols("GDX", from=startDate, to=endDate,auto.assign=FALSE)
y <- merge(Ad(GDX), Ad(GLD))
eqn <- as.formula(paste(colnames(y), collapse=" ~ 0 + "))
m <- lm(eqn, data=y)
ur_df_0<-ur.df(m$residuals, type="none", lags=0) #now we run a standard Dickey Fuller test with 0 lags and no constant
summary(ur_df_0) # to see the test stat and p-value
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################

Test regression none


Call:
lm(formula = z.diff ~ z.lag.1 - 1)

Residuals:
     Min       1Q   Median       3Q      Max
-1.46514 -0.29108 -0.03333  0.25579  1.60938

Coefficients:
        Estimate Std. Error t value Pr(>|t|)   
z.lag.1 -0.01385    0.00370  -3.744 0.000193 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4456 on 839 degrees of freedom
Multiple R-squared:  0.01643,  Adjusted R-squared:  0.01526
F-statistic: 14.02 on 1 and 839 DF,  p-value: 0.0001935


Value of test-statistic is: -3.744

Critical values for test statistics:
      1pct  5pct 10pct

tau1 -2.58 -1.95 -1.62


#now the long way to do this – via an OLS regression of the first difference of residuals against lagged residuals (lag=1)
n<-lm(diff(m$residuals) ~ lag(m$residuals[-length(m$residuals)], k=1) +0)
summary(n) #to see full output
# or alternatively to see only t-value and p-value:
summary(n)$coef[,3] # to see t-value
-3.743962

summary(n)$coef[,4] # to see p-value

0.0001934995


We have both approaches producing the same result: value of the test statistics is -3.744 and p-value of 0.000193.

Now, the Augmented-Dickey Fuller:
adf<-ur.df(m$residuals, type="none", lags=1)
summary(adf) # to see the test stat and p-value
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################

Test regression none


Call:
lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)

Residuals:
     Min       1Q   Median       3Q      Max
-1.45579 -0.29098 -0.03797  0.24630  1.56707

Coefficients:
           Estimate Std. Error t value Pr(>|t|)   
z.lag.1    -0.01263    0.00370  -3.413 0.000673 ***
z.diff.lag -0.09048    0.03408  -2.655 0.008089 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4418 on 837 degrees of freedom
Multiple R-squared:  0.02144,  Adjusted R-squared:  0.01911
F-statistic: 9.171 on 2 and 837 DF,  p-value: 0.0001148


Value of test-statistic is: -3.4134

Critical values for test statistics:
      1pct  5pct 10pct

tau1 -2.58 -1.95 -1.62

#now the long way to do this – via an OLS regression of the first difference of residuals against lagged residuals (lag=1) and the lagged first difference of residuals. Complicated? Yep, a bit but here it is:
z.diff ~ z.lag.1 - 1 + z.diff.lag

n<-lm((diff(m$residuals))[-1] ~ lag(m$residuals[-length(m$residuals)], k=1)[-1] + lag(diff(m$residuals)[-length(diff(m$residuals))], k=1) +0)

summary(n) #to see full output

# or alternatively to see only t-value and p-value:
summary(n)$coef[,3] # to see t-value
       lag(m$residuals[-length(m$residuals)], k = 1)[-1]
                                                -3.413356
lag(diff(m$residuals)[-length(diff(m$residuals))], k = 1)
                                                -2.654686


summary(n)$coef[,4] # to see p-value
        lag(m$residuals[-length(m$residuals)], k = 1)[-1]
                                             0.0006725011
lag(diff(m$residuals)[-length(diff(m$residuals))], k = 1)

                                             0.0080889750 
The value of the test statistics is -3.413 (from ur.df) and the value of the test stat of the term of the lagged residuals from the linear regression is the same: -3.413356. (The same as the result of the ADF test of the 10 pairs).

It should also be noted, as Ernie Chan highlighted (http://epchan.blogspot.bg/2013/11/cointegration-trading-with-log-prices.html), the difference between price spreads and log price spreads. In the price spread the number of shares should be kept fixed, and short the spread when it is much higher than average (say 1.5 standard deviations above the historical average or some other number), and long the spread when it is much lower. For a stationary log price spread themarket values of stocks should be kept fixed, which means that at the end of every bar, we need to rebalance the shares due to price changes.

Thursday, 24 March 2016

Romanian Equities: Python-based Risk Optimization

I always wanted to apply the previously mentioned in this blog risk optimization techniques to a market that is not extensively researched from risk perspective. Romanian stock market, for instance. Now, that I have some time (and hopefully I will have time to follow the development of the model portfolio) I am providing the results. Previous posts on this blog have already outlined steps and underlying idea about risk optimization.

So, directly to implementation: I use a short period of time: Jan 21, 2015 – March 21, 2016 (I usually tend to work with longer period but for this post I decided to narrow the sample). Raw stock prices are updated for stock dividends but also for cash dividends (i.e. prices before the ex-dividend days are adjusted) so I am avoiding breaks in price series data.

The portfolio amount is RON1.0 mln. I work with 99% confidence interval.




Results from optimization:

One-day 99% VaR:
Portfolio VaR (equal weighted): RON20356.59 
Optimal VaR (optimized weights): RON18146.25


First, I am standardizing daily returns (by subtracting the mean from the daily return and then normalizing the result by the standard deviation) to yield number of standard deviations from the mean, i.e. showing how “unusual” are the daily returns during the analysed period. 


In terms of Python code:

import pandas as pd
import numpy as np
from numpy import *
import matplotlib.pyplot as plt
import scipy.optimize as scopt
from scipy.stats import norm

%matplotlib inline

Value=1e6 # RON1,000,000
CI=0.99 # set the confidence interval

data_all=pd.read_csv('RO.csv', delimiter=',') #database contains also a trading day
data=data_all.ix[:,1:14] #here I exclude the day to calculate returns

tickers =['SNG', 'SNP', 'TGN', 'TEL', 'BRD', 'TLV', 'FP', 'EL', 'SIF1', 'SIF2', 'SIF3', 'SIF4', 'SIF5']
numbers=len(tickers)

ret=data/data.shift(1)-1 # calculate the simple returns
ret=ret[1:] # remove first row since it contanins NA

covariances=ret.cov() #gives the covariance of returns
variances=np.diag(covariances) #extracts variances of the individual shares from covariance matrix
volatility=np.sqrt(variances) #gives standard deviation
annualized_ret=ret.mean()*252
annualized_vol=ret.std()*sqrt(252) #anualized volatility

weights = np.ones(data.columns.size)/data.columns.size #starting point is equal weighted portfolio
weights

# 1. Calculate the standardized returns
standardized=(ret-mean(ret))/std(ret) #standardized returns
day=pd.DataFrame(data_all['Day'][1:]) #takes the trading day from the database
data2=pd.DataFrame.join(day, standardized)

# We can plot the standardized returns, for instance here is the chart for Romgaz (SNG):
 data2.plot('Day','SNG')
plt.xlabel('Day')
plt.title('Romgaz')
plt.xticks(rotation=20)
plt.ylabel('number of standard deviations from average')


# 2. Now heading for the real exercise
Pf_variance=np.dot(weights.T,np.dot(ret.cov(),weights)) #Portfolio variance
Pf_volatility=np.sqrt(Pf_variance) #Portfolio standard deviation

Moneyvariance=np.square(Value)*Pf_variance
Moneyvolatility=np.sqrt(Moneyvariance)

covariance_asset_portfolio=np.dot(weights.T,covariances)
cov_money=np.multiply(covariance_asset_portfolio,Value)
beta=np.divide(covariance_asset_portfolio,Pf_variance)

def VaR():
    # this code calculates Portfolio Value-at-risk (Pf_VaR) in RON-terms and Individual Value-at-risk (IndividualVaR) of shares in portfolio.
    Pf_VaR=norm.ppf(CI)*Moneyvolatility
    IndividualVaR=np.multiply(volatility,Value*weights)*norm.ppf(CI)
    IndividualVaR = [ 'RON%.2f' % elem for elem in IndividualVaR ]
    print ('Portfolio VaR: ', 'RON%0.2f' %Pf_VaR)
    print ('Individual VaR: ',[[tickers[i], IndividualVaR[i]] for i in range (min(len(tickers), len(IndividualVaR)))])

VaR()


def marginal_component_VaR():
     # this code calculates Marginal Value-at-risk in RON-terms and Component Value-at-risk of shares in portfolio.
    marginalVaR=np.divide(cov_money,Moneyvolatility)*norm.ppf(CI)
    componentVaR=np.multiply(weights,beta)*Moneyvolatility*norm.ppf(CI)
    marginalVaR = [ '%.3f' % elem for elem in marginalVaR ]
    componentVaR=[ 'RON%.2f' % elem for elem in componentVaR ]
    print ('Marginal VaR:', [[tickers[i], marginalVaR[i]] for i in range (min(len(tickers), len(marginalVaR)))])
    print ('Component VaR: ', [[tickers[i], componentVaR[i]] for i in range (min(len(tickers), len(componentVaR)))])

marginal_component_VaR()


def VaR(weights):
    return norm.ppf(CI)*np.sqrt(np.square(Value)*(np.dot(weights.T,np.dot(ret.cov(),weights))))


def minVaR(VaR):
    nstocks = data.columns.size
    bounds = [(0.,1.) for i in np.arange(nstocks)] #non-negative weights
    constraints = ({'type': 'eq',
                        'fun': lambda weights: np.sum(weights) - 1}) #sum of weights to equal 100%
    results = scopt.minimize(VaR, weights,
                                 method='SLSQP',
                                 constraints = constraints,
                                 bounds = bounds)
    result_weights=np.round(results.x, 4)
    result_weights= [ '%.8f' % elem for elem in result_weights ]
    new_VaR=VaR(results['x'])
             
    print ('Optimal VaR: ', new_VaR)
    print ('Optimal Weights: ', [[tickers[i], result_weights[i]] for i in range (len(tickers))])

minVaR(VaR)
beta

This publication is for information purposes only and should not be construed as a solicitation or an offer to buy or sell any securities.