Creating histograms is not that simple as it may seem! There are
of course default approaches for creating a histogram, but it is nevertheless better
to know what you get instead of just looking at the resulted chart. Besides, we
need not only the chart per se, but
also the frequency of our data linked to respective bins. It should also be
noted while it is useful to customize binning there is no guarantee it yields
better result than the automatically binning.
Python offers wide range of possibilities to create
histograms. But with this post I would like to introduce AstroML. AstroML (http://www.astroml.org/) is a Python module for machine learning and data mining for
astronomy.
First, let’s present some theoretical background on
bin width methods. First, the two most popular rules of thumb for defining bin-width,
i.e. Freedman-Diaconis and Scott and second – rules that use fitness functions,
i.e. Bayesian blocks and Knuth.
The bin-width (h) and number of bins (W) under Freedman-Diaconis
and Scott rules are calculated as follows:
The other two – Bayesian blocks and Knuth’s rules –
are more computationally challenging as they require minimization of a cost
function. Astropy (http://astropy.readthedocs.org/en/latest/index.html)
gives a brief
info: Knuth’s rule chooses a constant bin size which minimizes the error of the
histogram’s approximation to the data, while the Bayesian Blocks uses a more
flexible method which allows varying bin widths.
Example: I use daily price data for Fondul
Proprietatea (ticker: FP), a Bucharest Stock Exchange listed company. My data
is stored as csv-file.
!pip
install astroML
import
pandas as pd
import
numpy as np
from
astroML.plotting import hist
data=pd.read_csv('FP.csv',
delimiter=';')
ret=data/data.shift(1)-1
#calculate simple return
ret=
ret['FP'][~np.isnan(ret['FP'])] #remove NaN in the first row of
the column (since we calculate returns we get NaN in the first day) and ‘FP’ in this code indicates the name of the column
hist(ret,
bins='knuth') #Knuth’s rule
hist(ret,
bins='blocks') #Bayesian blocks
hist(ret,
bins='freedman') #Freedman-Diaconis
hist(ret,
bins='scott') #Scott
We
have our return series divided into different number of bins: ranging from 11
bins (Bayesian blocks) to 46 (Knuth’s rule).
Just
for the nice math we can create our own code for bins calculation: let’s do something
for the Freedman-Diaconis rule:
binwidth=2*(np.percentile(ret,75)-np.percentile(ret,25))*len(ret)**(-1.0/3.0)
# bin-width is calculated in accordance with formula above.
bins=np.subtract(ret.max(),ret.min())/binwidth #this is the number of bins
plt.hist(ret.values, bins=np.linspace(min(ret),
max(ret), bins))
Or alternatively (with slightly different result):
plt.hist(ret.values, bins=np.arange(ret.min(),
ret.max() + binwidth, binwidth))