Thursday, 10 December 2015

Data Cleaning: Minimum Covariance Determinant and Winsorization

“The real voyage of discovery consists not in seeking new landscapes, but in having new eyes.” Marcel Proust


Robust statistics aims at identifying the core of the data. Put in a more detailed way – “the general principle of robust statistical estimation is to give full weights to observations assumed to come from the main body of the data, but to reduce or completely eliminate weights for the observations from tails of the contaminated data. Treating extreme values (outliers) is very important and requires testing different strategies. Below is just one approach how to deal with outliers.
The two parameters that are cornerstone are location vector and scatter matrix. In a univariate setting, the median is the well-known parameter for the location, for the scale we have for instance two – interquartile range and MAD. For a multivariate setting the situation with identifying and treating outliers gets a bit complicated. So, (eventually) the minimum covariance determinant (MCD) method, introduced by Rousseeuw in 1985, solves the issue.
Basically, there are three important steps in cleaning the data from outliers (using the Boudt and Peterson approach):
(1)    Find localtion and scatter via minimum covariance determinant (MCD) method;
(2)    Use the estimated location and scatter in step 1 to estimate the squared Mahalanobis distance. Mahalanobis distance is calculated by:
Where mu is the location and S is the scatter (covariance)
(1)    Define the alpha most extreme observations as outliers. Multivariate outliers are defined as observations having a large squared Mahalanobis distance. For this purpose, a quantile of the chi-squared distribution (in our case this is 99%) is considered.
(2)    Clean data but not via removing extreme observations (trimming, truncation) but via winsorization (following Kahn). Winsorization is a transformation that limits the extreme values of observations.  It is different than trimming which excludes the extreme values.The new value of the outliers is:


where rt is the original observation. The cleaned return vector has the same orientation as the original return vector, but its magnitude is smaller.

Here is a short example of the implemented in R (following clean.boudt function: https://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/PerformanceAnalytics/R/Return.clean.R?revision=1956&root=returnanalytics):
 
library(quantmod)
library(PerformanceAnalytics)
library(robustbase)

alpha=0.01
trim=0.001

#example with two shares (Microsoft and Apple);, working with adjusted prices
symbol.vec = c("MSFT", "AAPL")
getSymbols(symbol.vec, from ="2001-01-01", to = "2015-12-04")
MSFT = MSFT[, "MSFT.Adjusted", drop=F]
AAPL = AAPL[, "AAPL.Adjusted", drop=F]

#calculating the log-returns and removing the first NAs
MSFT.ret = CalculateReturns(MSFT, method="log")
AAPL.ret = CalculateReturns(AAPL, method="log")
MSFT.ret = MSFT.ret[-1,]
AAPL.ret = AAPL.ret[-1,]
colnames(MSFT.ret) ="MSFT"
colnames(AAPL.ret) = "AAPL"

#create one database by combining the two shares
data = cbind(MSFT.ret,AAPL.ret)
data=checkData(data,method="zoo")

T=dim(data)[1]
N=dim(data)[2]
date=c(1:T)

MCD = covMcd(as.matrix(data),alpha=1-alpha)
mu = MCD$raw.center #no reweighting
sigma = MCD$raw.cov
invSigma = solve(sigma);
vd2t = c();
cleaneddata = data
outlierdate = c()

for(t in c(1:T) )
{
        d2t = as.matrix(data[t,]-mu)%*%invSigma%*%t(as.matrix(data[t,]-mu));
        vd2t = c(vd2t,d2t);
}

out = sort(vd2t,index.return=TRUE)
sortvd2t = out$x;
sortt = out$ix;


empirical.threshold = sortvd2t[floor((1-alpha)*T)];

T.alpha = floor(T * (1-alpha))+1
cleanedt=sortt[c(T.alpha:T)]

for(t in cleanedt ){
        if(vd2t[t]>qchisq(1-trim,N)){
                # print(c("Observation",as.character(date[t]),"is detected as outlier and cleaned") );
                cleaneddata[t,] = sqrt( max(empirical.threshold,qchisq(1-trim,N))/vd2t[t])*data[t,];
                outlierdate = c(outlierdate,date[t]) } }

print(list(cleaneddata,outlierdate)) 

write.csv(cleaneddata, file = "data.csv",row.names=FALSE)

all<-cbind(data, cleaneddata) #to see how raw and robust returns look like
plot(all$MSFT.data) #plot raw returns
plot(all$MSFT.cleaneddata) #plot cleaned returns


No comments:

Post a Comment