Nightly bin RVs
One of the most common operations performed on time series of radial velocities
is to bin a number of measurements performed on the same night.
In a previous post, I briefly talked about
scipy’s binned_statistic
function, which can be used to bin data.
But it turns out this function is missing one important feature
which is necessary for the specific RV use-case.
In this post, I will present a modification of binned_statistic
which allows for the use of measurement uncertainties as weights
during the binning calculation.
This modified function will then be applied to RV time series.
First, some brief words of motivation.
The reason why we want to bin is not the most important1.
On a given night, we might have obtained shorter exposures
of a very bright star in order not to saturate our detector,
and finally want to combine the measurements.
Or we might be trying to average out stellar oscillations
by combining two observations from the same night.
Whatever the reason, each RV measurement comes with an associated uncertainty,
which we might want to use during the binning.
This will allow us to weigh each RV point individually.
In addition, we are also interested in combining the times of the observations,
as well as the uncertainty measures themselves.
changes to binned_satistic
The implementation of this function in scipy can be found
here.
After checking some of the arguments, the binned_statistic
function
calls binned_statistic_dd
and builds a custom namedtuple
with the result2 before returning.
We need to re-write both binned_statistic
and binned_statistic_dd
,
if we want to include an extra parameter “weights
”.
In the following, I will leave out the modifications to the doc strings
as they (the doc strings) are quite lengthy.
At the end of the post, you can find
the full implementation including these changes.
For our script to be self-contained, we first need some imports
import numpy as np
from scipy._lib.six import callable, xrange
from scipy._lib._numpy_compat import suppress_warnings
Now, the modified binned_statistic
function:
def binned_statistic(x, values, statistic='mean', bins=10,
range=None, weights=None):
try:
N = len(bins)
except TypeError:
N = 1
if N != 1:
bins = [np.asarray(bins, float)]
if range is not None:
if len(range) == 2:
range = [range]
medians, edges, binnumbers = binned_statistic_dd(
[x], values, statistic, bins, range, weights)
return medians, edges[0], binnumbers
Yes, that’s anticlimactic.
The only modifications to the original function are the inclusion of
weights
in the arguments and in the call to binned_statistic_dd
.
So, it’s binned_statistic_dd
which is doing all the work.
But, even there, not much really changes:
def binned_statistic_dd(sample, values, statistic='mean', bins=10,
range=None, weights=None,
expand_binnumbers=False):
## ...
## nothing changes
## ...
values = np.atleast_2d(values)
if weights is not None:
weights = np.atleast_2d(weights)
## ...
## nothing changes
## ...
elif callable(statistic):
with np.errstate(invalid='ignore'), suppress_warnings() as sup:
sup.filter(RuntimeWarning)
try:
if weights is None:
null = statistic([])
else:
null = statistic([], weights=[])
except:
null = np.nan
result.fill(null)
for i in np.unique(binnumbers):
for vv in xrange(Vdim):
which = binnumbers == i
if weights is None:
result[vv, i] = statistic(values[vv, which])
else:
result[vv, i] = statistic(values[vv, which],
weights=weights[vv, which])
## ...
## nothing changes
## ...
return result, edges, binnumbers
Besides adding weights
as an argument
and making sure it has the correct shape,
we only change the code for the cases when statistic
is a callable.
In these cases, we will need to provide a function f(values, weights)
that will do the actual binning.
These small modifications already make binned_statistic
able to calculate a weighted average within the bins,
by using the appropriate statistic3:
result = binned_statistic(x, values, statistic=np.average,
weights=error_values):
radial velocities
Now let’s go into our application to radial velocity time series.
The typical dataset will have times, radial velocities, and uncertainties
time, rv, err = data
and we want a function to nightly bin the three arrays. The first step will be to define the bins, based on the integer part of the array of times:
intt = time.astype(int)
_, indices = np.unique(intt, return_index=True)
We will have to be a bit careful to create one extra bin
so that the last time is included.
If the last time is “alone” and doesn’t actually need to be binned,
we need to make sure the bin edge is not exactly equal to time[-1]
if indices[-1] == time.size - 1:
bins = np.r_[time[indices], time[-1] + 1e-10]
else:
bins = np.r_[time[indices], time[-1]]
That was the hard part. Now, we define a function to calculate a variance-weighted average, which we will use as the default for the times and radial velocities when the uncertainties are provided.
# weighted mean
wmean = lambda a, e: np.average(a, weights=1/e**2)
We use the new binned_statistic
to bin the radial velocities
using the weighted mean
stat = wmean
brv = binned_statistic(time, rv, statistic=stat, bins=bins,
weights=err)
and the times, also using the weighted mean
tstat = wmean
btime = binned_statistic(time, time, statistic=tstat, bins=bins,
weights=err)
For binning the uncertainties, we have a few choices. I think the most statistically-justified option is to add the uncertainties quadratically, like this
add_quadratically = lambda a: np.sqrt(np.add.reduce(a**2))
mean_quadratically = lambda a: add_quadratically(a) / a.size
estat = mean_quadratically
berr = binned_statistic(time, err, statistic=estat, bins=bins)
Note that add_quadratically
and mean_quadratically
are only separated for clarity, and they could very easily be merged.
If we assume each radial-velocity measurement to be a Gaussian random variable
with variance given by the square of the associated uncertainty,
then this is the correct approach when binning.
But sometimes you might be interested in performing another operation
on the uncertainties, such as a simple average:
berr = binned_statistic(time, err, statistic='mean', bins=bins)
By the end, the binned times, radial velocities and uncertainties
will be the first values of btime
, brv
, and berr
.
In summary, joining everything in a function:
def binRV(time, rv, err, stat='wmean', tstat='wmean', estat='addquad'):
""" Bin a dataset of radial-velocity observations.
Parameters
----------
time : array
The array of times where the radial velocity is measured.
This function does nightly binning, based on the integer part
of this array.
rv : array
The radial-velocity values.
err : array (optional)
The uncertainty on the radial-velocity values.
stat : string or callable (optional)
The statistic to compute for the radial velocities. By default,
this is the weighted mean ('wmean') in which case the routine
does inverse variance averaging, using 1/`err`**2 as weights.
tstat : string or callable (optional)
The statistic to compute for the times. The default is the same
as for the radial velocities.
estat : string or callable (optional)
The statistic to compute for the errors. The default is to add
them quadratically and divide by the number in each bin.
Notes
-----
Arguably, the most justified way to perform binning is to use the
defaults. This will lead to a weighted average of the times and the
radial velocities, and quadratically added errors.
Other options for `stat`, `tstat`, and `estat` are:
'mean', 'median', 'count', 'sum', 'min', 'max', in addition to
any callable of the form `f(array, weights)`.
"""
intt = time.astype(int)
_, indices = np.unique(intt, return_index=True)
if indices[-1] == time.size - 1:
bins = np.r_[time[indices], time[-1] + 1e-10]
else:
bins = np.r_[time[indices], time[-1]]
wmean = lambda a, e: np.average(a, weights=1 / e**2)
if stat == 'wmean':
stat = wmean
brv = binned_statistic(time, rv, statistic=stat, bins=bins,
weights=err)
if tstat == 'wmean':
tstat = wmean
btime = binned_statistic(time, time, statistic=tstat, bins=bins,
weights=err)
if estat == 'addquad':
add_quadratically = lambda a: np.sqrt(np.add.reduce(a**2))
mean_quadratically = lambda a: add_quadratically(a) / a.size
estat = mean_quadratically
berr = binned_statistic(time, err, statistic=estat, bins=bins)
else:
berr = binned_statistic(time, err, statistic=estat, bins=bins,
weights=err)
return btime[0], brv[0], berr[0]
wrap up
And there we have it, a relatively simple function to nightly bin a dataset of radial velocities. I have not done any serious performance tests,4 but I expect this function to be fast enough for most of your needs. Let me know in the comments if there’s something missing.
scripts
The full script can be found here.