“The real voyage of discovery consists not in seeking new landscapes,
but in having new eyes.” Marcel Proust
Robust statistics aims at identifying the core
of the data. Put in a more detailed way – “the general principle of
robust statistical estimation is to give full weights to observations assumed
to come from the main body of the data, but to reduce or completely eliminate
weights for the observations from tails of the contaminated data”. Treating extreme values (outliers) is very
important and requires testing different strategies. Below is just one approach
how to deal with outliers.
The two parameters that are cornerstone are location
vector and scatter matrix. In a univariate setting, the median is the
well-known parameter for the location, for the scale we have for instance two –
interquartile range and MAD. For a multivariate setting the situation with
identifying and treating outliers gets a bit complicated. So, (eventually) the
minimum covariance determinant (MCD) method, introduced by Rousseeuw in 1985,
solves the issue.
Basically, there are three important steps in
cleaning the data from outliers (using the Boudt and Peterson approach):
(1)
Find
localtion and scatter via minimum covariance determinant (MCD) method;
(2)
Use
the estimated location and scatter in step 1 to estimate the squared Mahalanobis
distance. Mahalanobis distance is calculated by:
Where mu is the location
and S is the scatter (covariance)
(1)
Define
the alpha most extreme observations as outliers. Multivariate outliers are defined
as observations having a large squared Mahalanobis distance. For this purpose,
a quantile of the chi-squared distribution (in our case this is 99%) is considered.
(2)
Clean
data but not via removing extreme observations (trimming, truncation) but via
winsorization (following Kahn). Winsorization is a transformation that limits
the extreme values of observations. It
is different than trimming which excludes the extreme values.The new value of
the outliers is:
where rt is the original
observation. The cleaned return vector has the same orientation as the original
return vector, but its magnitude is smaller.
Here is a short example of the implemented in R (following clean.boudt function: https://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/PerformanceAnalytics/R/Return.clean.R?revision=1956&root=returnanalytics):
library(quantmod)
library(PerformanceAnalytics)
library(robustbase)
alpha=0.01
trim=0.001
#example with two shares (Microsoft and Apple);, working with adjusted
prices
symbol.vec
= c("MSFT", "AAPL")
getSymbols(symbol.vec,
from ="2001-01-01", to = "2015-12-04")
MSFT
= MSFT[, "MSFT.Adjusted", drop=F]
AAPL
= AAPL[, "AAPL.Adjusted", drop=F]
#calculating the log-returns and removing the first
NAs
MSFT.ret
= CalculateReturns(MSFT, method="log")
AAPL.ret
= CalculateReturns(AAPL, method="log")
MSFT.ret
= MSFT.ret[-1,]
AAPL.ret
= AAPL.ret[-1,]
colnames(MSFT.ret)
="MSFT"
colnames(AAPL.ret)
= "AAPL"
#create one database by combining the two shares
data
= cbind(MSFT.ret,AAPL.ret)
data=checkData(data,method="zoo")
T=dim(data)[1]
N=dim(data)[2]
date=c(1:T)
MCD = covMcd(as.matrix(data),alpha=1-alpha)
mu = MCD$raw.center #no reweighting
sigma = MCD$raw.cov
invSigma = solve(sigma);
vd2t = c();
cleaneddata = data
outlierdate = c()
for(t in c(1:T) )
{
d2t = as.matrix(data[t,]-mu)%*%invSigma%*%t(as.matrix(data[t,]-mu));
vd2t = c(vd2t,d2t);
}
out = sort(vd2t,index.return=TRUE)
sortvd2t = out$x;
sortt = out$ix;
empirical.threshold =
sortvd2t[floor((1-alpha)*T)];
T.alpha = floor(T * (1-alpha))+1
cleanedt=sortt[c(T.alpha:T)]
for(t in cleanedt ){
if(vd2t[t]>qchisq(1-trim,N)){
#
print(c("Observation",as.character(date[t]),"is detected as
outlier and cleaned") );
cleaneddata[t,] = sqrt(
max(empirical.threshold,qchisq(1-trim,N))/vd2t[t])*data[t,];
outlierdate =
c(outlierdate,date[t]) } }
print(list(cleaneddata,outlierdate))
write.csv(cleaneddata, file = "data.csv",row.names=FALSE)
all<-cbind(data, cleaneddata) #to see how raw and
robust returns look like
plot(all$MSFT.data) #plot raw returns
plot(all$MSFT.cleaneddata) #plot cleaned returns