Robust statistics

Introduction

One of the most common tasks when analyzing data from the lab is to estimate the mean and standard deviation of some set of samples, usually with a Gaussian distribution. This is very straightforward nowadays, we just ask to the computer to compute these quantities and that’s it. And when the samples come from a Gaussian distribution we are estimating µ and σ respectively. That sounds easy.

Recently, however, I came across the problem of dealing with outliers, i.e. samples that are definitely wrong. For example in the list of numbers [1,2,3,4,5,6,99] there is clearly one outlier[1]The 99, in case you didn’t noticed it. which we can remove by hand and then analyze the rest of the numbers. These outliers are usually very easy to spot and remove, but we don’t want to do this “by hand”, or by hardcoding a cut in our data analysis program. There has to be something better… And there is!

If we are dealing with a symmetric distribution, which is very often the case, we can replace the mean by the median. This, I think, is a very well known trick. The median will provide a number close to the mean but is much more robust against outliers. Note that this works only for symmetric distributions.

What about the standard deviation? I recently came across the median absolute deviation that provides a measure of the variability of a set of samples (in the same way as the standard deviation) but is way more robust against outliers than the former. Of course it is a different quantity, so we wont expect it will converge to the same value as the standard deviation. But under certain conditions, common in practice, they are related by an easy equation. Specifically, in the case of a set of samples from a Gaussian distribution it they relate by[2]See here.

\[\text{STD} \sim k \times \text{MAD}\]

where \(k\approx1.4826\), which is really useful.

There is a third method usually employed to get the µ and the σ from sampled data that consists in fitting a Gaussian function to the histogram, I will also study this method today.

To summarize, today I will be comparing

  1. Comparison of estimators of µ:
    • Mean (the usual estimator).
    • Median (the robust estimator).
    • µ from a Gaussian fit to the histogram (the sophisticated estimator).
  2. Comparison of estimators of σ:
    • Standard deviation (std) (the usual estimator).
    • Median absolute deviation (MAD) multiplied by k (the robust estimator).
    • σ from a Gaussian fit to the histogram (the sophisticated estimator).

When there are a lot of samples and no outliers

I will first compare what happens when we are in a favorable condition with a lot of samples and no outliers, i.e. the “happy scenario”. The following plot is a histogram of a single set of 9999 samples from a Gaussian distribution with µ=0 and σ=1. On the same plot I have indicated the different statistics I am interested in, namely the mean, the median, the std, the median absolute deviation (MAD) and a Gaussian function fitted to the histogram.

This is the ideal case: a lot of samples without outliers, so things will work as expected. Under these conditions the distribution of each statistic itself is shown in the following two plots:

For µ we can see that the three estimators are practically the same, maybe the mean is slightly better than the other two. For σ we see that std is slightly better than the other two, and the other two behave practically the in the same way. So in “the happy scenario” we can (of course) trust in the mean and the std estimators. We all knew this.

A lot of samples but with outliers

A single outlier

How do things change if we keep all the same but add a single outlier? Now I repeated the same procedure but using 9998 samples from a normal distribution and in the sample number 9999 I forced manually an outlier with a value of 99. So this is still a “very good scenario” in which we have 9999 samples with just a single outlier which is very easy to spot.

One single example of such a distribution is this:

Zooming into the interesting region we already notice something: the standard deviation is not what we expect, due to the outlier. The distribution of each of the statistics themselves is shown below:

Here we see something interesting and is that the standard deviation is completely wrong. You will say if 0.4 is too much or it is acceptable. But one single outlier in a population of 9999 samples (this is 0.01 % of the data) was enough to break it. The other two estimators of σ remain unchanged, which is good. With respect to µ we see that the mean was also affected but not so much.

Several outliers

Let’s get more severe now and multiply the last 99 samples by 99, thus producing 99 outliers. So now our population has 9900 “good samples” and 99 outliers, this represents 1% of corrupted data. It is still a decent situation and, more important to me, one that I am finding quite often in the lab.

One example of such a distribution is shown below:

If a human were to analyze such distribution he/she could easily spot the Gaussian in the middle (feel free to zoom in the plot by drag and drop with your mouse) and conclude that the population comes from a normal distribution with some outliers. We see, however, that the standard deviation is, again, failing at this, with a value ~10 times bigger than expected[3]Well, the standard deviation is properly doing its job, what is wrong here is that there are outliers…. If we look at the distribution of each of the statistics we are studying we get this:

It is evident (right plot) that the standard deviation is failing to tell us the value of σyet but the MAD is still working! (please zoom in to see this) It is only slightly shifted to 1.01 instead of 1. Impressive. The Gaussian fit to the histogram itself is still intact, so it is the most robust method under these conditions. In the case of the mean, we don’t expect it to be shifted because the outliers are symmetric with respect to the the µ of the original Gaussian, however we see that it spreads much more. The median and the Gaussian fit, instead, still remain intact.

Few samples without outliers

The last case I want to analyze is that with few samples and no outliers. So this is “a good measurement but with low statistics”. Can we still rely on the median, the MAD and the Gaussian fit?

So now I repeat exactly the same but with a set of 44 samples from a normal distribution. One of such experiments with 44 samples looks like this:

I bet that many humans will have a hard time looking at the histogram and trying to tell whether these samples are coming from a Gaussian or not. Anyway, if we look at how each of the statistics perform in these conditions we get this:

Here again we see a surprise: if we don’t zoom in horizontally we cannot tell what is going on… So please zoom in horizontally before reading more, you can do this by click and drag with your mouse horizontally… If you did zoom in, you already know what is going on: the mean, the median, the std and the MAD are all working good but the Gaussian fit has this time produced some outliers. The reason for this is that we have so little statistics that the histogram is hardly recognizable as coming from a Gaussian, and the fit is trying to fit a Gaussian to it, so it is prompt to fail. It is still a good method, most of the times it works, but if you are unlucky your data may be one of those cases in which the fit either produces one outlier (i.e. it tells you that µ=200, or σ=150, or both) or even worse the fit may not converge (for sets with 44 samples this was happening to me approximately 0.4% of the times).

Conclusion

If dealing with Gaussian distributed data, both the median and the median absolute deviation multiplied by 1.4826 are nice alternatives for the mean and the standard deviation when you want to be robust to outliers. They are a workhorse which will provide a good answer practically in any scenario[4]when your data is Gaussian!. A Gaussian fit is also very robust, in fact is the most robust when the data set is big enough. However it requires more effort to be implemented[5]i.e. it is more complicated than MAD=scipy.stats.median_abs_deviation(your_data), and when the dataset is small it is prompt to fail and produce disparate results or even not converging.

Appendix: Software

Just for reference and for the curious reader who wants to explore in more detail, here I left the Python code used to produce the plots:

import numpy as np
import grafica # https://github.com/SengerM/grafica
from scipy.stats import median_abs_deviation
import pandas
from scipy.optimize import curve_fit

MAD_TO_STD_FACTOR = 1.4826 # Constant from Wikipedia.

def gaussian(x, mu, sigma, amplitude=1):
	return amplitude/sigma/(2*np.pi)**.5*np.exp(-((x-mu)/sigma)**2/2)

def fit_gaussian(x_values, y_values):
	popt, pcov = curve_fit(
		gaussian,
		x_values,
		y_values,
		p0 = [np.median(samples),median_abs_deviation(samples)*MAD_TO_STD_FACTOR,max(y_values)],
	)
	return popt[0], popt[1], popt[2]

def draw_center_and_band_lines(plotlyfig, color, y_position, center: float, center_name: str=None, band: float=None, band_name: str=None):
	plotlyfig.add_vline(
		x = center,
		annotation_text = f'{center_name} = {center:.2f}' if center_name is not None else '',
		line_color = color,
	)
	if band is not None:
		for s in [-band, band]:
			plotlyfig.add_vline(
				x = center + s,
				line_color = color,
				line_dash = 'dash',
			)
		plotlyfig.add_annotation(
			x = center,
			y = y_position,
			ax = center + band,
			ay = y_position,
			yref = 'y domain',
			showarrow = True,
			axref = "x", ayref='y',
			arrowhead = 3,
			arrowwidth = 1.5,
		)
		plotlyfig.add_annotation(
			x = center + band,
			y = y_position,
			ax = center,
			ay = y_position,
			yref = 'y domain',
			showarrow = True,
			axref = "x", ayref='y',
			arrowhead = 3,
			arrowwidth = 1.5,
		)
		plotlyfig.add_annotation(
			text = f'{band_name} = {band:.2f}' if band_name is not None else '',
			ax = center + band/2,
			ay = y_position+.02,
			x = center + band/2,
			y = y_position+.02,
			yref = "y domain",
			axref = "x", ayref='y',
		)

def draw_histogram_and_analysis_of_single_experiment(samples, bins):
	fig = grafica.new(
		ylabel = 'Count',
	)
	fig.histogram(
		samples,
		bins = bins,
		label = f'{len(samples)} samples',
	)
	draw_center_and_band_lines(
		plotlyfig = fig.plotly_figure,
		center = samples.mean(),
		center_name = 'Mean',
		band = samples.std(),
		band_name = 'std',
		color = 'black',
		y_position = .1,
	)
	draw_center_and_band_lines(
		plotlyfig = fig.plotly_figure,
		center = np.median(samples),
		center_name = 'Median',
		band = median_abs_deviation(samples),
		band_name = 'MAD',
		color = '#fcba03',
		y_position = .4,
	)
	draw_center_and_band_lines(
		plotlyfig = fig.plotly_figure,
		center = np.median(samples),
		band = median_abs_deviation(samples)*MAD_TO_STD_FACTOR,
		band_name = f'MAD*{MAD_TO_STD_FACTOR}',
		color = '#47a130',
		y_position = .7,
	)
	return fig

results_df = pandas.DataFrame()
for n_experiment in range(9999): # This is the number of experiments.
	samples = np.random.randn(9999) # Change the number of samples.
	# ~ samples[-99:] *= 99 # Produce outliers with this line.
	samples = np.array(samples)
	# Fit a Gaußian to the histogram ---
	bins = np.linspace(min(samples),max(samples),(max(samples)-min(samples))/.2)
	hist, bins_edges = np.histogram(
		samples, 
		bins = bins,
	)
	x_values = bins_edges[:-1] + np.diff(bins_edges)[0]/2
	y_values = hist
	try:
		fitted_mu, fitted_sigma, fitted_amplitude = fit_gaussian(x_values, y_values)
	except RuntimeError: # This happens when the fit fails because there are very few samples.
		fitted_mu, fitted_sigma, fitted_amplitude = (float('NaN'),)*3
	results_df = results_df.append(
		{
			'mean': samples.mean(),
			'std': samples.std(),
			'MAD': median_abs_deviation(samples),
			'MAD*k': median_abs_deviation(samples)*MAD_TO_STD_FACTOR,
			'median': np.median(samples),
			'mu fit': fitted_mu,
			'sigma fit': fitted_sigma,
		},
		ignore_index = True,
	)

print(f'Number of failed Gaussian fits: {results_df["mu fit"].isna().sum()}')
fig = draw_histogram_and_analysis_of_single_experiment(samples, bins)
fig.scatter(
	x = x_values,
	y = gaussian(x_values, fitted_mu, fitted_sigma, fitted_amplitude),
	label = f'Fitted Gaußian (μ={fitted_mu:.2f}, σ={fitted_sigma:.2f})',
)
fig.title = 'Single experiment example'

for comparison in ['Estimating mu','Estimating sigma']:
	fig = grafica.new(
		title = comparison,
		ylabel = 'Count',
	)
	for col in (['mean','median','mu fit'] if comparison == 'Estimating mu' else ['std','MAD*k','sigma fit']):
		fig.histogram(
			results_df[col],
			label = col,
		)
	fig.plotly_figure.add_vline(
		x = 0 if 'mu' in comparison else 1,
		annotation_text = 'µ' if 'mu' in comparison else 'σ',
	)

grafica.save_unsaved(mkdir='plots')

References

References
1 The 99, in case you didn’t noticed it.
2 See here.
3 Well, the standard deviation is properly doing its job, what is wrong here is that there are outliers…
4 when your data is Gaussian!
5 i.e. it is more complicated than MAD=scipy.stats.median_abs_deviation(your_data)