I frequently acquire and analyze very long time series with millions of data points. My workflow requires interactive plots of the data with zoom and pan and the ability to quickly place cursors on individual datapoints to read off their precise values and to find their indices within the time series.

Here I show how to use HoloViews, Bokeh, and DataShader to build a class that

  1. Allows the interactive plotting of very large datasets and
  2. Simulatenously lets you drag and drop cursors onto individual datapoints.

You can find a full implementation here: https://github.com/tobiasbartsch/plotting/blob/master/plots_with_cursors.py

Note that I took some inspiration from https://github.com/pyviz/holoviews/issues/3248.

Interactive plotting of very large datasets

First, let's generate a small time trace to show how plotting with HoloViews works -- Gaussian noise will do.

In [1]:
import numpy as np
import xarray as xr

dvals = np.random.normal(2,3,size=int(1e2)) #Gaussian-distributed noise with mean = 2 and sdev = 3, 200 points.

#Holoviews requires the data to be annotated. We will use an xarray.DataArray, but you could also use a pandas.DataFrame.
timeseries = xr.DataArray(dvals,
                   dims=['time'],
                   coords = {'time': np.arange(len(dvals))},
                   name='Gaussian Noise')

HoloViews can now create an appropriate Bokeh plot from the annotated data; Bokeh then automatically takes care of interactivity (zoom/pan/etc).

In [2]:
import holoviews as hv
hv.extension('bokeh')

hv.Scatter(timeseries).opts(width=600)
Out[2]:

The interactivity of this plot is exactly what we wanted. Note that this a very small data set: Bokeh cannot display $10^6$ datapoints in the same manner (if you would try that it would crash your browser). The workaround to this dilemma is to use DataShader, which bins the points of the graph into individual pixels and returns am image (rather than a large vector object) for Bokeh to display.

Let's make a large data set to try out DataShader.

In [3]:
dvals = np.random.normal(2,3,size=int(1e6)) #Gaussian-distributed noise with mean = 2 and sdev = 3, 1e6 points.

timeseries = xr.DataArray(dvals,
                   dims=['time'],
                   coords = {'time': np.arange(len(dvals))},
                   name='Gaussian Noise')

Holoviews contains a datashader module:

In [4]:
import holoviews.operation.datashader as hd
hd.shade.cmap=["darkblue"]

hd.datashade(hv.Scatter(timeseries)).opts(width=600)
Out[4]:

Note that we are passing the Scatter object to the datashade method. Make sure you do not call hv.Scatter(data) without passing it to datashade. Bokeh would try its best to plot a vectorized version of the data but it would probably crash your browser. As long as your Jupyter notebook has an active python kernel, datashader will recompute the displayed image after every zoom and pan and thus supports Bokeh's interactivity. For very large datasets these computations can become demanding, resulting in less responsive interactivity.

The datashaded graph also contains information on how many datapoints ended up in each pixel of the rendered image, which we can visualize by changing the hd.shade.cmap color map.

In [5]:
hd.shade.cmap=["lightblue", "darkblue"]
hd.datashade(hv.Scatter(timeseries)).opts(width=600)
Out[5]:

Cursors for individual data points

Holoviews/Bokeh supports the selection of data from plots. In principle you can read out the values and indices of the selected points from a data stream (see for example http://holoviews.org/reference/streams/bokeh/Selection1D_points.html). Unfortunately, this does not work for datashaded plots since Bokeh does not actually know the location of any of the data points (recall that DataShader only passes a 2D histogram of binned data to Bokeh).

As a workaround we will do the following:

  • allow the user to draw a point (the "cursor") onto the datashaded plot.
  • from the x- coordinate of the cursor find the corresponding data value in the (not datashaded) time series.
  • move the cursor to the location of the data point determined in the previous step (which "snaps" the cursor onto the graph).

Holoviews provides a stream of points that the user can draw on the screen; we can pass this stream into a dynamic map and display it overlayed over our datashaded curve.

In [6]:
def _snap(data): 
    '''callback method for the Dynamic map. Return a list of points based on the data in the stream.'''
    #logic will go here
    return hv.Points(data).opts(color='red')

cursor_stream = hv.streams.PointDraw(data={'x': [], 'y': []})
hd.datashade(hv.Scatter(timeseries)).opts(width=600) * hv.DynamicMap(_snap, streams=[cursor_stream])
Out[6]:

Let's first pack everything into a class before we work on the logic of the callback function:

In [7]:
class DataShadedWithCursors(object):
    '''Combines a datashaded plot with dynamic cursors. Assumes that the data was sampled at a set rate and that adjacent data points are equidistant along the time axis.'''
    
    def __init__(self, timeseries):
        '''Construct a DataShadedWithCursors object.
            Args:
                timeseries (xarray.DataArray): the time series
        '''
        if(type(timeseries) != xr.core.dataarray.DataArray):
            raise ValueError("data must be an xarray.DataArray.")

        self._timeseries = timeseries
        _coord = list(self._timeseries.coords.keys())[0] #key of the first coordinate axis in the xarray
        self._dt = float(timeseries[_coord][1]-timeseries[_coord][0]) #find increment between successive steps of the coordinate axis.
        
    def _snap(self, data):
        '''The callback function'''
        
        #logic will go here
        return hv.Points(self.pnts_dict, vdims='index').opts(size=10, color='red')
    
    @property
    def view(self):
        dshade = hd.datashade(hv.Curve(self._timeseries)).opts(width=800)
        cursor_stream = hv.streams.PointDraw(data={'x': [], 'y': []})
        cursor_dmap = hv.DynamicMap(self._snap, streams=[cursor_stream])
        return (dshade * cursor_dmap)

Instances of this class have access to self._timeseries, to which we want to snap the cursors. We can access the plot through the view property. Let's now compute the point(s) in self._timeseries to which we want to move the cursor(s). Only the snap function was edited in the following block of code.

In [8]:
class DataShadedWithCursors(object):
    '''Combines a datashaded plot with dynamic cursors. Assumes that the data was sampled at a set rate and that adjacent data points are equidistant along the time axis.'''
    
    def __init__(self, timeseries):
        '''Construct a DataShadedWithCursors object.
            Args:
                timeseries (xarray.DataArray): the time series
        '''
        if(type(timeseries) != xr.core.dataarray.DataArray):
            raise ValueError("data must be an xarray.DataArray.")

        self._timeseries = timeseries
        _coord = list(self._timeseries.coords.keys())[0] #key of the first coordinate axis in the xarray
        self._dt = float(timeseries[_coord][1]-timeseries[_coord][0]) #find increment between successive steps of the coordinate axis.
        
    def _snap(self, data):
        '''Snap cursors (PointDraw stream) to the underlying data of the graph'''
        self.pnts_snapped = []
        for x in data['x']:
            index = int(np.floor(x/self._dt))
            self.pnts_snapped.append([float(x), float(self._timeseries.values[index]), index])
        pnts_dict = {'x': [p[0] for p in self.pnts_snapped], 'y': [p[1] for p in self.pnts_snapped], 'index': [p[2] for p in self.pnts_snapped]}
        return hv.Points(pnts_dict, vdims='index').opts(size=10, color='red')

    @property
    def view(self):
        dshade = hd.datashade(hv.Curve(self._timeseries)).opts(width=800)
        cursor_stream = hv.streams.PointDraw(data={'x': [], 'y': []})
        cursor_dmap = hv.DynamicMap(self._snap, streams=[cursor_stream])
        return (dshade * cursor_dmap)

Try it out! Remember that the "snapping" requires a Jupyter server with an active python kernel, so you will need to get it into a Jupyter notebook (it won't work if you are looking at it on my blog)!

In [9]:
hd.shade.cmap=["lightblue", "darkblue"]
pwc = DataShadedWithCursors(timeseries)
pwc.view
Out[9]:

Now that we have cursors that snap onto our data, we would like to know the values of the marked data points. Holoviews lets you create a table and link it to the cursor data, a modification that was added to the view property in the code below.

In [10]:
from holoviews.plotting.links import DataLink


class DataShadedWithCursors(object):
    '''Combines a datashaded plot with dynamic cursors. Assumes that the data was sampled at a set rate and that adjacent data points are equidistant along the time axis.'''
    
    def __init__(self, timeseries):
        '''Construct a DataShadedWithCursors object.
            Args:
                timeseries (xarray.DataArray): the time series
        '''
        if(type(timeseries) != xr.core.dataarray.DataArray):
            raise ValueError("data must be an xarray.DataArray.")

        self._timeseries = timeseries
        _coord = list(self._timeseries.coords.keys())[0] #key of the first coordinate axis in the xarray
        self._dt = float(timeseries[_coord][1]-timeseries[_coord][0]) #find increment between successive steps of the coordinate axis.
        
    def _snap(self, data):
        '''Snap cursors (PointDraw stream) to the underlying data of the graph'''
        self.pnts_snapped = []
        for x in data['x']:
            index = int(np.floor(x/self._dt))
            self.pnts_snapped.append([float(x), float(self._timeseries.values[index]), index])
        pnts_dict = {'x': [p[0] for p in self.pnts_snapped], 'y': [p[1] for p in self.pnts_snapped], 'index': [p[2] for p in self.pnts_snapped]}
        return hv.Points(pnts_dict, vdims='index').opts(size=10, color='red')

    @property
    def view(self):
        dshade = hd.datashade(hv.Curve(self._timeseries)).opts(width=800)
        cursor_stream = hv.streams.PointDraw(data={'x': [], 'y': []})
        cursor_dmap = hv.DynamicMap(self._snap, streams=[cursor_stream])
        table = hv.Table(cursor_dmap, ['x', 'y']).opts(editable=True)
        DataLink(cursor_dmap, table)
        return (dshade * cursor_dmap) + table
In [11]:
pwc = DataShadedWithCursors(timeseries)
pwc.view
Out[11]: