Thursday 16 July 2015

AstroML for Creating Histograms in Python

Creating histograms is not that simple as it may seem! There are of course default approaches for creating a histogram, but it is nevertheless better to know what you get instead of just looking at the resulted chart. Besides, we need not only the chart per se, but also the frequency of our data linked to respective bins. It should also be noted while it is useful to customize binning there is no guarantee it yields better result than the automatically binning.

Python offers wide range of possibilities to create histograms. But with this post I would like to introduce AstroML. AstroML (http://www.astroml.org/) is a Python module for machine learning and data mining for astronomy. 

First, let’s present some theoretical background on bin width methods. First, the two most popular rules of thumb for defining bin-width, i.e. Freedman-Diaconis and Scott and second – rules that use fitness functions, i.e. Bayesian blocks and Knuth.

The bin-width (h) and number of bins (W) under Freedman-Diaconis and Scott rules are calculated as follows: 


The other two – Bayesian blocks and Knuth’s rules – are more computationally challenging as they require minimization of a cost function. Astropy (http://astropy.readthedocs.org/en/latest/index.html) gives a brief info: Knuth’s rule chooses a constant bin size which minimizes the error of the histogram’s approximation to the data, while the Bayesian Blocks uses a more flexible method which allows varying bin widths.

Example: I use daily price data for Fondul Proprietatea (ticker: FP), a Bucharest Stock Exchange listed company. My data is stored as csv-file.


!pip install astroML

import pandas as pd
import numpy as np
from astroML.plotting import hist

data=pd.read_csv('FP.csv', delimiter=';')
ret=data/data.shift(1)-1 #calculate simple return
ret= ret['FP'][~np.isnan(ret['FP'])] #remove NaN in the first row of the column (since we calculate returns we get NaN in the first day) and ‘FP’  in this code indicates the name of the column

hist(ret, bins='knuth') #Knuth’s rule
hist(ret, bins='blocks') #Bayesian blocks
hist(ret, bins='freedman') #Freedman-Diaconis
hist(ret, bins='scott') #Scott

We have our return series divided into different number of bins: ranging from 11 bins (Bayesian blocks) to 46 (Knuth’s rule).

Just for the nice math we can create our own code for bins calculation: let’s do something for the Freedman-Diaconis rule:

binwidth=2*(np.percentile(ret,75)-np.percentile(ret,25))*len(ret)**(-1.0/3.0) # bin-width is calculated in accordance with formula above.
bins=np.subtract(ret.max(),ret.min())/binwidth #this is the number of bins
plt.hist(ret.values, bins=np.linspace(min(ret), max(ret), bins))

Or alternatively (with slightly different result):
plt.hist(ret.values, bins=np.arange(ret.min(), ret.max() + binwidth, binwidth))