Tuesday, 17 May 2016

Cointegration in R

Cointegration is one of the most appealing and most controversial analyses. It is appealing since it is the basis of the pair trading strategy. And is controversial since it has a property to break, sometimes for an extended period and additionally there are different techniques that from time to time produce conflicting results.
In a nutshell, the approach is:
(1)    run an OLS linear regression (the coefficient beta of the regression is the hedge ratio)
(2)    test the residuals for presence of unit root (via Augmented-Dickey-Fuller Unit Root Test). The residuals of the regression represent the spread. And the spread is in fact: Dependent variable – Hedge ratio * Independent variable. If we reject the Null Hypothesis of Unit Root we conclude that the spread is stationary and the two variables are mean-reverting. This also means that two variables are cointegrated.
An alternative technique to the ADF test of the residuals is the Johansen cointegration test (Trace and Eigenvalue test) applied to the analysed variables. However, in most of the practical guides I’ have read the ADF approach is preferred. I think because it is more intuitive and easier to follow than the Johansen test.


Looking at five ETFs: SPY (SPDR S&P 500), IYY (iShares Dow Jones), IWM (iShares Russel 2000), GLD (SPDR Gold Shares) and GDX (VanEck Vectors Gold Miners). The last two pairs are viewed somehow as a textbook example.
Codes and results are highlighted.
library(quantmod) 
library(fUnitRoots) #for the ADF test

tickers <- c('GDX','GLD', 'IWM' , 'IYY','SPY')

startDate="2013-01-01"
endDate="2016-05-04"

f <- function(x) {
        x1 <- getSymbols(x[1], from=startDate, to=endDate, auto.assign=FALSE)
        x2 <- getSymbols(x[2], from=startDate, to=endDate,auto.assign=FALSE)
        y <- merge(Ad(x1),Ad(x2))
        eqn <- as.formula(paste(colnames(y), collapse=" ~ 0 + "))
        m <- lm(eqn, data=y)
        adf<-adfTest(m$residuals, type="nc")
        cat("Period is from:", startDate, "to:", endDate, "\n", "Dependent variable and independent variable are:", colnames(y), "\n", "The value of the test statistics is:", adf@test$statistic, "\n")
        cat("ADF p-value is", adf@test$p.value, "\n")
}

p <- combn(tickers, 2, FUN=f, simplify=FALSE)

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GDX.Adjusted GLD.Adjusted
 The value of the test statistics is: -3.413356
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GDX.Adjusted IWM.Adjusted
 The value of the test statistics is: -3.717138
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GDX.Adjusted IYY.Adjusted
 The value of the test statistics is: -3.647598
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GDX.Adjusted SPY.Adjusted
 The value of the test statistics is: -3.629596
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GLD.Adjusted IWM.Adjusted
 The value of the test statistics is: -3.073779
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GLD.Adjusted IYY.Adjusted
 The value of the test statistics is: -2.953128
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: GLD.Adjusted SPY.Adjusted
 The value of the test statistics is: -2.91571
ADF p-value is 0.01

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: IWM.Adjusted IYY.Adjusted
 The value of the test statistics is: -0.9656891
ADF p-value is 0.3085453

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: IWM.Adjusted SPY.Adjusted
 The value of the test statistics is: -0.8500551
ADF p-value is 0.3454008

Period is from: 2013-01-01 to: 2016-05-04
 Dependent variable and independent variable are: IYY.Adjusted SPY.Adjusted
 The value of the test statistics is: -0.9760423
ADF p-value is 0.3052455

Looking at the values of test statistics of the ADF test if they are greater in absolute terms than the critical values we can reject the Null Hypothesis of unit root and conclude that the residuals are stationary (we also see the p-values that are smaller than 1%, 5% or 10% levels of significance). Hence, for our 10 pairs we can see there is cointegration between 7 pairs:

GDX and GLD
GDX and IWM
GDX and IYY
GDX and SPY
GLD and IWM
GLD and IYY
GLD and SPY
(Please note that the regression distinguishes between dependent and independent variable in the respective pair).

A way to look at the standard Dickey Fuller test  (differ with the Augmented-Dickey Fuller test with the number of lags -> in the standard Dickey Fuller test the  number of lags is 0) of the residuals from the regression is to run again an OLS regression of first difference of residuals on lagged residuals. Let’s play a bit step-by-step with this and see what we get in R:

For the example, we will use ur.df function in R to get the ADF and standard DF test values.

startDate="2013-01-01"
endDate="2016-05-04"

GLD <- getSymbols("GLD", from=startDate, to=endDate, auto.assign=FALSE)
GDX <- getSymbols("GDX", from=startDate, to=endDate,auto.assign=FALSE)
y <- merge(Ad(GDX), Ad(GLD))
eqn <- as.formula(paste(colnames(y), collapse=" ~ 0 + "))
m <- lm(eqn, data=y)
ur_df_0<-ur.df(m$residuals, type="none", lags=0) #now we run a standard Dickey Fuller test with 0 lags and no constant
summary(ur_df_0) # to see the test stat and p-value
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################

Test regression none


Call:
lm(formula = z.diff ~ z.lag.1 - 1)

Residuals:
     Min       1Q   Median       3Q      Max
-1.46514 -0.29108 -0.03333  0.25579  1.60938

Coefficients:
        Estimate Std. Error t value Pr(>|t|)   
z.lag.1 -0.01385    0.00370  -3.744 0.000193 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4456 on 839 degrees of freedom
Multiple R-squared:  0.01643,  Adjusted R-squared:  0.01526
F-statistic: 14.02 on 1 and 839 DF,  p-value: 0.0001935


Value of test-statistic is: -3.744

Critical values for test statistics:
      1pct  5pct 10pct

tau1 -2.58 -1.95 -1.62


#now the long way to do this – via an OLS regression of the first difference of residuals against lagged residuals (lag=1)
n<-lm(diff(m$residuals) ~ lag(m$residuals[-length(m$residuals)], k=1) +0)
summary(n) #to see full output
# or alternatively to see only t-value and p-value:
summary(n)$coef[,3] # to see t-value
-3.743962

summary(n)$coef[,4] # to see p-value

0.0001934995


We have both approaches producing the same result: value of the test statistics is -3.744 and p-value of 0.000193.

Now, the Augmented-Dickey Fuller:
adf<-ur.df(m$residuals, type="none", lags=1)
summary(adf) # to see the test stat and p-value
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################

Test regression none


Call:
lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)

Residuals:
     Min       1Q   Median       3Q      Max
-1.45579 -0.29098 -0.03797  0.24630  1.56707

Coefficients:
           Estimate Std. Error t value Pr(>|t|)   
z.lag.1    -0.01263    0.00370  -3.413 0.000673 ***
z.diff.lag -0.09048    0.03408  -2.655 0.008089 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4418 on 837 degrees of freedom
Multiple R-squared:  0.02144,  Adjusted R-squared:  0.01911
F-statistic: 9.171 on 2 and 837 DF,  p-value: 0.0001148


Value of test-statistic is: -3.4134

Critical values for test statistics:
      1pct  5pct 10pct

tau1 -2.58 -1.95 -1.62

#now the long way to do this – via an OLS regression of the first difference of residuals against lagged residuals (lag=1) and the lagged first difference of residuals. Complicated? Yep, a bit but here it is:
z.diff ~ z.lag.1 - 1 + z.diff.lag

n<-lm((diff(m$residuals))[-1] ~ lag(m$residuals[-length(m$residuals)], k=1)[-1] + lag(diff(m$residuals)[-length(diff(m$residuals))], k=1) +0)

summary(n) #to see full output

# or alternatively to see only t-value and p-value:
summary(n)$coef[,3] # to see t-value
       lag(m$residuals[-length(m$residuals)], k = 1)[-1]
                                                -3.413356
lag(diff(m$residuals)[-length(diff(m$residuals))], k = 1)
                                                -2.654686


summary(n)$coef[,4] # to see p-value
        lag(m$residuals[-length(m$residuals)], k = 1)[-1]
                                             0.0006725011
lag(diff(m$residuals)[-length(diff(m$residuals))], k = 1)

                                             0.0080889750 
The value of the test statistics is -3.413 (from ur.df) and the value of the test stat of the term of the lagged residuals from the linear regression is the same: -3.413356. (The same as the result of the ADF test of the 10 pairs).

It should also be noted, as Ernie Chan highlighted (http://epchan.blogspot.bg/2013/11/cointegration-trading-with-log-prices.html), the difference between price spreads and log price spreads. In the price spread the number of shares should be kept fixed, and short the spread when it is much higher than average (say 1.5 standard deviations above the historical average or some other number), and long the spread when it is much lower. For a stationary log price spread themarket values of stocks should be kept fixed, which means that at the end of every bar, we need to rebalance the shares due to price changes.

Thursday, 24 March 2016

Romanian Equities: Python-based Risk Optimization

I always wanted to apply the previously mentioned in this blog risk optimization techniques to a market that is not extensively researched from risk perspective. Romanian stock market, for instance. Now, that I have some time (and hopefully I will have time to follow the development of the model portfolio) I am providing the results. Previous posts on this blog have already outlined steps and underlying idea about risk optimization.

So, directly to implementation: I use a short period of time: Jan 21, 2015 – March 21, 2016 (I usually tend to work with longer period but for this post I decided to narrow the sample). Raw stock prices are updated for stock dividends but also for cash dividends (i.e. prices before the ex-dividend days are adjusted) so I am avoiding breaks in price series data.

The portfolio amount is RON1.0 mln. I work with 99% confidence interval.




Results from optimization:

One-day 99% VaR:
Portfolio VaR (equal weighted): RON20356.59 
Optimal VaR (optimized weights): RON18146.25


First, I am standardizing daily returns (by subtracting the mean from the daily return and then normalizing the result by the standard deviation) to yield number of standard deviations from the mean, i.e. showing how “unusual” are the daily returns during the analysed period. 


In terms of Python code:

import pandas as pd
import numpy as np
from numpy import *
import matplotlib.pyplot as plt
import scipy.optimize as scopt
from scipy.stats import norm

%matplotlib inline

Value=1e6 # RON1,000,000
CI=0.99 # set the confidence interval

data_all=pd.read_csv('RO.csv', delimiter=',') #database contains also a trading day
data=data_all.ix[:,1:14] #here I exclude the day to calculate returns

tickers =['SNG', 'SNP', 'TGN', 'TEL', 'BRD', 'TLV', 'FP', 'EL', 'SIF1', 'SIF2', 'SIF3', 'SIF4', 'SIF5']
numbers=len(tickers)

ret=data/data.shift(1)-1 # calculate the simple returns
ret=ret[1:] # remove first row since it contanins NA

covariances=ret.cov() #gives the covariance of returns
variances=np.diag(covariances) #extracts variances of the individual shares from covariance matrix
volatility=np.sqrt(variances) #gives standard deviation
annualized_ret=ret.mean()*252
annualized_vol=ret.std()*sqrt(252) #anualized volatility

weights = np.ones(data.columns.size)/data.columns.size #starting point is equal weighted portfolio
weights

# 1. Calculate the standardized returns
standardized=(ret-mean(ret))/std(ret) #standardized returns
day=pd.DataFrame(data_all['Day'][1:]) #takes the trading day from the database
data2=pd.DataFrame.join(day, standardized)

# We can plot the standardized returns, for instance here is the chart for Romgaz (SNG):
 data2.plot('Day','SNG')
plt.xlabel('Day')
plt.title('Romgaz')
plt.xticks(rotation=20)
plt.ylabel('number of standard deviations from average')


# 2. Now heading for the real exercise
Pf_variance=np.dot(weights.T,np.dot(ret.cov(),weights)) #Portfolio variance
Pf_volatility=np.sqrt(Pf_variance) #Portfolio standard deviation

Moneyvariance=np.square(Value)*Pf_variance
Moneyvolatility=np.sqrt(Moneyvariance)

covariance_asset_portfolio=np.dot(weights.T,covariances)
cov_money=np.multiply(covariance_asset_portfolio,Value)
beta=np.divide(covariance_asset_portfolio,Pf_variance)

def VaR():
    # this code calculates Portfolio Value-at-risk (Pf_VaR) in RON-terms and Individual Value-at-risk (IndividualVaR) of shares in portfolio.
    Pf_VaR=norm.ppf(CI)*Moneyvolatility
    IndividualVaR=np.multiply(volatility,Value*weights)*norm.ppf(CI)
    IndividualVaR = [ 'RON%.2f' % elem for elem in IndividualVaR ]
    print ('Portfolio VaR: ', 'RON%0.2f' %Pf_VaR)
    print ('Individual VaR: ',[[tickers[i], IndividualVaR[i]] for i in range (min(len(tickers), len(IndividualVaR)))])

VaR()


def marginal_component_VaR():
     # this code calculates Marginal Value-at-risk in RON-terms and Component Value-at-risk of shares in portfolio.
    marginalVaR=np.divide(cov_money,Moneyvolatility)*norm.ppf(CI)
    componentVaR=np.multiply(weights,beta)*Moneyvolatility*norm.ppf(CI)
    marginalVaR = [ '%.3f' % elem for elem in marginalVaR ]
    componentVaR=[ 'RON%.2f' % elem for elem in componentVaR ]
    print ('Marginal VaR:', [[tickers[i], marginalVaR[i]] for i in range (min(len(tickers), len(marginalVaR)))])
    print ('Component VaR: ', [[tickers[i], componentVaR[i]] for i in range (min(len(tickers), len(componentVaR)))])

marginal_component_VaR()


def VaR(weights):
    return norm.ppf(CI)*np.sqrt(np.square(Value)*(np.dot(weights.T,np.dot(ret.cov(),weights))))


def minVaR(VaR):
    nstocks = data.columns.size
    bounds = [(0.,1.) for i in np.arange(nstocks)] #non-negative weights
    constraints = ({'type': 'eq',
                        'fun': lambda weights: np.sum(weights) - 1}) #sum of weights to equal 100%
    results = scopt.minimize(VaR, weights,
                                 method='SLSQP',
                                 constraints = constraints,
                                 bounds = bounds)
    result_weights=np.round(results.x, 4)
    result_weights= [ '%.8f' % elem for elem in result_weights ]
    new_VaR=VaR(results['x'])
             
    print ('Optimal VaR: ', new_VaR)
    print ('Optimal Weights: ', [[tickers[i], result_weights[i]] for i in range (len(tickers))])

minVaR(VaR)
beta

This publication is for information purposes only and should not be construed as a solicitation or an offer to buy or sell any securities.

Thursday, 31 December 2015

Itô's Lemma Applied in Finance

It would not be exaggerated to say that Itô's Lemma is one of the building blocks in the stochastic analysis. Itô's Lemma is essentially the chain rule for stochastic functions.  The lemma is an important part in valuing derivatives since a derivative is a function of the price of the underlying and time. Changes in a variables, for example change in stock price, involve a deterministic component which is a function of time and a stochastic component.

For instance:

x is the stock price at time t
dx is the change in x over the interval of time dt.
dz is the change in the random variable z over this interval of time is dz (stated briefly, dz is a Wiener process).

The change in stock price is:
dx=adt+bdz,
where a and b are functions of x. The variable x has a drift a and standard deviation b.

Under the Itô's Lemma the function F of x and t follows the process (we denote here as partial derivative; according to Wikipedia partial derivatice  of a function of several variables is its derivative with respect to one of those variables, with the pthers held constant, as opposed to the total derivative, in which all variables are allowed to vary)
:
Looking at perspective, applying the Itô's Lemma in valuing a forward contract on shares:

So we have:













This is the last post for 2015. I am wishing you an Ispiring New Year!

Thursday, 10 December 2015

Data Cleaning: Minimum Covariance Determinant and Winsorization

“The real voyage of discovery consists not in seeking new landscapes, but in having new eyes.” Marcel Proust


Robust statistics aims at identifying the core of the data. Put in a more detailed way – “the general principle of robust statistical estimation is to give full weights to observations assumed to come from the main body of the data, but to reduce or completely eliminate weights for the observations from tails of the contaminated data. Treating extreme values (outliers) is very important and requires testing different strategies. Below is just one approach how to deal with outliers.
The two parameters that are cornerstone are location vector and scatter matrix. In a univariate setting, the median is the well-known parameter for the location, for the scale we have for instance two – interquartile range and MAD. For a multivariate setting the situation with identifying and treating outliers gets a bit complicated. So, (eventually) the minimum covariance determinant (MCD) method, introduced by Rousseeuw in 1985, solves the issue.
Basically, there are three important steps in cleaning the data from outliers (using the Boudt and Peterson approach):
(1)    Find localtion and scatter via minimum covariance determinant (MCD) method;
(2)    Use the estimated location and scatter in step 1 to estimate the squared Mahalanobis distance. Mahalanobis distance is calculated by:
Where mu is the location and S is the scatter (covariance)
(1)    Define the alpha most extreme observations as outliers. Multivariate outliers are defined as observations having a large squared Mahalanobis distance. For this purpose, a quantile of the chi-squared distribution (in our case this is 99%) is considered.
(2)    Clean data but not via removing extreme observations (trimming, truncation) but via winsorization (following Kahn). Winsorization is a transformation that limits the extreme values of observations.  It is different than trimming which excludes the extreme values.The new value of the outliers is:


where rt is the original observation. The cleaned return vector has the same orientation as the original return vector, but its magnitude is smaller.

Here is a short example of the implemented in R (following clean.boudt function: https://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/PerformanceAnalytics/R/Return.clean.R?revision=1956&root=returnanalytics):
 
library(quantmod)
library(PerformanceAnalytics)
library(robustbase)

alpha=0.01
trim=0.001

#example with two shares (Microsoft and Apple);, working with adjusted prices
symbol.vec = c("MSFT", "AAPL")
getSymbols(symbol.vec, from ="2001-01-01", to = "2015-12-04")
MSFT = MSFT[, "MSFT.Adjusted", drop=F]
AAPL = AAPL[, "AAPL.Adjusted", drop=F]

#calculating the log-returns and removing the first NAs
MSFT.ret = CalculateReturns(MSFT, method="log")
AAPL.ret = CalculateReturns(AAPL, method="log")
MSFT.ret = MSFT.ret[-1,]
AAPL.ret = AAPL.ret[-1,]
colnames(MSFT.ret) ="MSFT"
colnames(AAPL.ret) = "AAPL"

#create one database by combining the two shares
data = cbind(MSFT.ret,AAPL.ret)
data=checkData(data,method="zoo")

T=dim(data)[1]
N=dim(data)[2]
date=c(1:T)

MCD = covMcd(as.matrix(data),alpha=1-alpha)
mu = MCD$raw.center #no reweighting
sigma = MCD$raw.cov
invSigma = solve(sigma);
vd2t = c();
cleaneddata = data
outlierdate = c()

for(t in c(1:T) )
{
        d2t = as.matrix(data[t,]-mu)%*%invSigma%*%t(as.matrix(data[t,]-mu));
        vd2t = c(vd2t,d2t);
}

out = sort(vd2t,index.return=TRUE)
sortvd2t = out$x;
sortt = out$ix;


empirical.threshold = sortvd2t[floor((1-alpha)*T)];

T.alpha = floor(T * (1-alpha))+1
cleanedt=sortt[c(T.alpha:T)]

for(t in cleanedt ){
        if(vd2t[t]>qchisq(1-trim,N)){
                # print(c("Observation",as.character(date[t]),"is detected as outlier and cleaned") );
                cleaneddata[t,] = sqrt( max(empirical.threshold,qchisq(1-trim,N))/vd2t[t])*data[t,];
                outlierdate = c(outlierdate,date[t]) } }

print(list(cleaneddata,outlierdate)) 

write.csv(cleaneddata, file = "data.csv",row.names=FALSE)

all<-cbind(data, cleaneddata) #to see how raw and robust returns look like
plot(all$MSFT.data) #plot raw returns
plot(all$MSFT.cleaneddata) #plot cleaned returns