# Nightly bin RVs

One of the most common operations performed on time series of radial velocities
is to *bin* a number of measurements performed on the same night.
In a previous post, I briefly talked about
scipy’s `binned_statistic`

function, which can be used to bin data.
But it turns out this function is missing one important feature
which is necessary for the specific RV use-case.

In this post, I will present a modification of `binned_statistic`

which allows for the use of measurement uncertainties as weights
during the binning calculation.
This modified function will then be applied to RV time series.

First, some brief words of motivation.

The *reason* why we want to bin is not the most important^{1}.
On a given night, we might have obtained shorter exposures
of a very bright star in order not to saturate our detector,
and finally want to combine the measurements.
Or we might be trying to average out stellar oscillations
by combining two observations from the same night.
Whatever the reason, each RV measurement comes with an associated uncertainty,
which we might want to use during the binning.
This will allow us to *weigh* each RV point individually.
In addition, we are also interested in combining the times of the observations,
as well as the uncertainty measures themselves.

## changes to `binned_satistic`

The implementation of this function in scipy can be found
here.
After checking some of the arguments, the `binned_statistic`

function
calls `binned_statistic_dd`

and builds a custom namedtuple
with the result^{2} before returning.

We need to re-write both `binned_statistic`

and `binned_statistic_dd`

,
if we want to include an extra parameter “`weights`

”.
In the following, I will leave out the modifications to the doc strings
as they (the doc strings) are quite lengthy.
At the end of the post, you can find
the full implementation including these changes.

For our script to be self-contained, we first need some imports

```
import numpy as np
from scipy._lib.six import callable, xrange
from scipy._lib._numpy_compat import suppress_warnings
```

Now, the modified `binned_statistic`

function:

```
def binned_statistic(x, values, statistic='mean', bins=10,
range=None, weights=None):
try:
N = len(bins)
except TypeError:
N = 1
if N != 1:
bins = [np.asarray(bins, float)]
if range is not None:
if len(range) == 2:
range = [range]
medians, edges, binnumbers = binned_statistic_dd(
[x], values, statistic, bins, range, weights)
return medians, edges[0], binnumbers
```

Yes, that’s anticlimactic.
The only modifications to the original function are the inclusion of
`weights`

in the arguments and in the call to `binned_statistic_dd`

.

So, it’s `binned_statistic_dd`

which is doing all the work.
But, even there, not much really changes:

```
def binned_statistic_dd(sample, values, statistic='mean', bins=10,
range=None, weights=None,
expand_binnumbers=False):
## ...
## nothing changes
## ...
values = np.atleast_2d(values)
if weights is not None:
weights = np.atleast_2d(weights)
## ...
## nothing changes
## ...
elif callable(statistic):
with np.errstate(invalid='ignore'), suppress_warnings() as sup:
sup.filter(RuntimeWarning)
try:
if weights is None:
null = statistic([])
else:
null = statistic([], weights=[])
except:
null = np.nan
result.fill(null)
for i in np.unique(binnumbers):
for vv in xrange(Vdim):
which = binnumbers == i
if weights is None:
result[vv, i] = statistic(values[vv, which])
else:
result[vv, i] = statistic(values[vv, which],
weights=weights[vv, which])
## ...
## nothing changes
## ...
return result, edges, binnumbers
```

Besides adding `weights`

as an argument
and making sure it has the correct shape,
we only change the code for the cases when `statistic`

is a callable.
In these cases, we will need to provide a function `f(values, weights)`

that will do the actual binning.

These small modifications already make `binned_statistic`

able to calculate a weighted average within the bins,
by using the appropriate statistic^{3}:

```
result = binned_statistic(x, values, statistic=np.average,
weights=error_values):
```

## radial velocities

Now let’s go into our application to radial velocity time series.

The typical dataset will have times, radial velocities, and uncertainties

```
time, rv, err = data
```

and we want a function to nightly bin the three arrays. The first step will be to define the bins, based on the integer part of the array of times:

```
intt = time.astype(int)
_, indices = np.unique(intt, return_index=True)
```

We will have to be a bit careful to create one extra bin
so that the last time is included.
If the last time is “alone” and doesn’t actually need to be binned,
we need to make sure the bin edge is not exactly equal to `time[-1]`

```
if indices[-1] == time.size - 1:
bins = np.r_[time[indices], time[-1] + 1e-10]
else:
bins = np.r_[time[indices], time[-1]]
```

That was the hard part. Now, we define a function to calculate a variance-weighted average, which we will use as the default for the times and radial velocities when the uncertainties are provided.

```
# weighted mean
wmean = lambda a, e: np.average(a, weights=1/e**2)
```

We use the new `binned_statistic`

to bin the radial velocities
using the weighted mean

```
stat = wmean
brv = binned_statistic(time, rv, statistic=stat, bins=bins,
weights=err)
```

and the times, also using the weighted mean

```
tstat = wmean
btime = binned_statistic(time, time, statistic=tstat, bins=bins,
weights=err)
```

For binning the uncertainties, we have a few choices. I think the most statistically-justified option is to add the uncertainties quadratically, like this

```
add_quadratically = lambda a: np.sqrt(np.add.reduce(a**2))
mean_quadratically = lambda a: add_quadratically(a) / a.size
estat = mean_quadratically
berr = binned_statistic(time, err, statistic=estat, bins=bins)
```

Note that `add_quadratically`

and `mean_quadratically`

are only separated for clarity, and they could very easily be merged.
If we assume each radial-velocity measurement to be a Gaussian random variable
with variance given by the square of the associated uncertainty,
then this is the correct approach when binning.
But sometimes you might be interested in performing another operation
on the uncertainties, such as a simple average:

```
berr = binned_statistic(time, err, statistic='mean', bins=bins)
```

By the end, the binned times, radial velocities and uncertainties
will be the first values of `btime`

, `brv`

, and `berr`

.
In summary, joining everything in a function:

```
def binRV(time, rv, err, stat='wmean', tstat='wmean', estat='addquad'):
""" Bin a dataset of radial-velocity observations.
Parameters
----------
time : array
The array of times where the radial velocity is measured.
This function does nightly binning, based on the integer part
of this array.
rv : array
The radial-velocity values.
err : array (optional)
The uncertainty on the radial-velocity values.
stat : string or callable (optional)
The statistic to compute for the radial velocities. By default,
this is the weighted mean ('wmean') in which case the routine
does inverse variance averaging, using 1/`err`**2 as weights.
tstat : string or callable (optional)
The statistic to compute for the times. The default is the same
as for the radial velocities.
estat : string or callable (optional)
The statistic to compute for the errors. The default is to add
them quadratically and divide by the number in each bin.
Notes
-----
Arguably, the most justified way to perform binning is to use the
defaults. This will lead to a weighted average of the times and the
radial velocities, and quadratically added errors.
Other options for `stat`, `tstat`, and `estat` are:
'mean', 'median', 'count', 'sum', 'min', 'max', in addition to
any callable of the form `f(array, weights)`.
"""
intt = time.astype(int)
_, indices = np.unique(intt, return_index=True)
if indices[-1] == time.size - 1:
bins = np.r_[time[indices], time[-1] + 1e-10]
else:
bins = np.r_[time[indices], time[-1]]
wmean = lambda a, e: np.average(a, weights=1 / e**2)
if stat == 'wmean':
stat = wmean
brv = binned_statistic(time, rv, statistic=stat, bins=bins,
weights=err)
if tstat == 'wmean':
tstat = wmean
btime = binned_statistic(time, time, statistic=tstat, bins=bins,
weights=err)
if estat == 'addquad':
add_quadratically = lambda a: np.sqrt(np.add.reduce(a**2))
mean_quadratically = lambda a: add_quadratically(a) / a.size
estat = mean_quadratically
berr = binned_statistic(time, err, statistic=estat, bins=bins)
else:
berr = binned_statistic(time, err, statistic=estat, bins=bins,
weights=err)
return btime[0], brv[0], berr[0]
```

## wrap up

And there we have it, a relatively simple function to nightly bin
a dataset of radial velocities.
I have not done any serious performance tests,^{4}
but I expect this function to be fast enough for most of your needs.
Let me know in the comments if there’s something missing.

### scripts

The full script can be found here.