QuantX Research: AstroML for Creating Histograms in Python

Creating histograms is not that simple as it may seem! There are of course default approaches for creating a histogram, but it is nevertheless better to know what you get instead of just looking at the resulted chart. Besides, we need not only the chart per se, but also the frequency of our data linked to respective bins. It should also be noted while it is useful to customize binning there is no guarantee it yields better result than the automatically binning.

Python offers wide range of possibilities to create histograms. But with this post I would like to introduce AstroML. AstroML (http://www.astroml.org/) is a Python module for machine learning and data mining for astronomy.

First, let’s present some theoretical background on bin width methods. First, the two most popular rules of thumb for defining bin-width, i.e. Freedman-Diaconis and Scott and second – rules that use fitness functions, i.e. Bayesian blocks and Knuth.

The bin-width (h) and number of bins (W) under Freedman-Diaconis and Scott rules are calculated as follows:

The other two – Bayesian blocks and Knuth’s rules – are more computationally challenging as they require minimization of a cost function. Astropy (http://astropy.readthedocs.org/en/latest/index.html) gives a brief info: Knuth’s rule chooses a constant bin size which minimizes the error of the histogram’s approximation to the data, while the Bayesian Blocks uses a more flexible method which allows varying bin widths.

Example: I use daily price data for Fondul Proprietatea (ticker: FP), a Bucharest Stock Exchange listed company. My data is stored as csv-file.

!pip install astroML

import pandas as pd

import numpy as np

from astroML.plotting import hist

data=pd.read_csv('FP.csv', delimiter=';')

ret=data/data.shift(1)-1 #calculate simple return

ret= ret['FP'][~np.isnan(ret['FP'])] #remove NaN in the first row of the column (since we calculate returns we get NaN in the first day) and ‘FP’ in this code indicates the name of the column

hist(ret, bins='knuth') #Knuth’s rule

hist(ret, bins='blocks') #Bayesian blocks

hist(ret, bins='freedman') #Freedman-Diaconis

hist(ret, bins='scott') #Scott