Machine Learning NYC Subway Single Molecule Mechanics

Crossing the Manhattan bridge (or applying single-molecule models to transit data)

About a year ago I moved from the Upper East Side (where Rockefeller University and my research job are located) to Brooklyn. This move came at the expense of a much longer commute using the Q line of the infamous NYC subway system. In recent years subway delays had got so out of hand that in 2017 New York Governor Andrew Cuomo declared a state of emergency to accelerate repairs of the decaying system. Even though improvements have been made since then, I often find myself waiting for trains that do not arrive or am stuck on trains that stall for no apparent reason somewhere between Brooklyn and Manhattan. The MTA publishes the status of its subway lines (“good service”, “delays”, etc.) but I have found that it can take 20 min from the onset of delays until the “good service” message disappears — which is much too long during rush hour. In this post (and in hopefully many that will follow it) I set out to understand how the MTA runs its subway system, to build tools that can inform about delays in real time, and perhaps to use historic data to predict delays in the future.

The data

I was delighted to find that a real time feed of the status of all trains exists, which I have been data mining in ~30 second intervals for the past three months. I plan to write a separate post on the data models I used to extract transit data from the feed — for now let’s just assume that we know the time stamps at which each train stopped at the stations along its line. For example, shown below is a data set of the transit times between Avenue U and Kings Hwy, computed from all trains that traversed the north-bound Q line during our time frame of observation.

We can plot such time series for any pair of stations along any line in the system (I will show more examples below); I chose this pair since it demonstrates a few features that are true for most stations:

  1. There are four large (and several small) gaps in the data, one in early April, two in May, and one in June. These gaps are my fault: I have been running the data mining software on an old Raspberry Pie, which sometimes freezes. At times such hang ups remained undetected for a while.
  2. On the surface the data appear to be largely Gaussian distributed (see right panel), albeit with a heavy tail comprising the transit times of trains that stall between the two stations. For the data shown above the overall mean is 89 s with a standard deviation of 28 s. Most of the data’s variance can likely be attributed to insufficient sampling: the MTA only updates the real time data feed every 30 s, so we only know about a train’s arrival at a station with ~30 s certainty
  3. Closer inspection of the data reveals that, contrary to a Gaussian distribution with constant mean, the time trace is only piecewise constant with sudden shifts in the mean. The standard deviation appears to remain constant during these shifts. To convince yourself that this is true, use the “box zoom tool” (located above the graph) to zoom into the data, in particular between mid April and mid May. The shifts in the mean likely occur due to scheduled (or unscheduled) maintenance of the subway system.

So far I have extensively mentioned “delays”, but what does it actually mean for a train to be delayed? A possible approach is to find the mean transit time between two stations and its standard deviation. A delayed train then takes several standard deviations longer than the mean to travel between the stations.
However, it follows from (3) that modeling the data as a random process with constant mean is not appropriate: a “fastest” average transit time between two stations exists, but scheduled changes in transit time (caused, for example, by maintenance) can shift that fastest mean to higher values. The value of these longer transit times is not random: on all days during which maintenance is scheduled the mean may well shift to the same value. It follows that the mean transit time between two stations is not a continuously distributed variable but is rather quantized (it can only assume one out of a set of predetermined values).

Rather than modelling the entire trace with one unique mean value, it is hence more useful to describe the data as piecewise constant. One possible model (found by an algorithm described in the remainder of this post) is shown in the graph below.

Modeling piecewise constant signals

Piecewise constant signals with large variance frequently occur in “my research life” in which I study the mechanics of single molecules and soft matter. In brief, a single molecule can exist in a set of “conformations” (arrangements of its constituent atoms). Under certain conditions the molecule can emit light; the light’s intensity depends on the molecule’s conformation. A molecule’s rearrangement of if its atoms (corresponding to “hops” between its conformations) can then be measured as changes in the mean intensity of the light that the molecule emits. Different intensities therefore reflect different states of the molecule. Various algorithms exist to fit these data. In particular such algorithms can answer the following questions:

  • How many different states (= mean intensity levels) are there in the data?
  • Where in the data are the change points at which the intensity switched to a new mean?
  • What is the rate of transitioning from one mean to another?

The STaSI algorithm

One of my favorite algorithms to fit piecewise-constant data is Fast Step Transition and State Identification (STaSI) formulated by B. Shuang et al. (2014) at Rice University. Even though “Stasi” is a rather problematic name (in particular if you know some German history), it does a great job of learning the smallest set of states and change points that still adequately describe the data. To convince yourself that this should be the goal of the algorithm, consider for a moment a model comprising an arbitrarily large amount of states and change points. The best-fitting such model would be exactly identical to the noisy time series and hence would be devoid of any meaning: each datum in the time series would be its own change point and own state (with its own mean value): the model would hopelessly overfit the data.

The STaSI algorithm first determines all statistically significant change points by comparing the mean of the time series to the left of a data point to that on its right using a t-test. The segments in between the identified change points are modeled as having constant means. This model is very complex (it has many parameters). For example, consider a piecewise constant process that switches back and forth between two states with ground truth means of say 0 and 200. Due to the noise on the data the algorithm would then find a mean for each segment close but not identical to the ground truth, say [0.0001, -0.06, -0.0002, 0.005, ..] for the states that correspond to ground truth 0, and [199.899, 200.01, 200.0005, …] for states that belong to ground truth 200. Ideally we want our algorithm to group all of these highly similar segments and assign each group to the same state (with one unique mean). The StaSI algorithm achieves this by iteratively generating a new model by grouping the two most similar states and assigning them their average mean. It then learns the ideal amount of states and change points by minimizing a function called minimum description length (MDL) which is the sum of the loss function F (the distance between the noisy data and the model in L1 norm) and the function G, which measures the complexity of the fitting model, \[\mbox{MDL} =\mbox{F} + \mbox{G.}\] F and G are defined as (see B. Shuang et al. (2014)) \[F = \frac{\sum_{i=1}^{N} \vert y(t_i) – y_{\mbox{fit}} (t_i) \vert}{2 \sigma} \\ G = \frac{k}{2} \ln{\frac{1}{2 \pi}} + k \ln{\frac{V}{\sigma}} + \frac{N_{\mbox{cp}}}{2} \ln{N} + \frac{1}{2} \left( \sum_{i=1}^k \ln{n_i} + \sum_{j=1}^{N_{\mbox{tp}}} \ln{\frac{T_j^2}{\sigma^2}} \right),\] where σ is the standard deviation of the data, y(ti) and yfit(ti) are the values at each time point ti of the data and of the fit respectively, N is the total number of points in the time series, k is the number of distinct states, Ncp is the number of change points where the model switches from one state to another, V = ymax – ymin, ni is the number of data points in state i, and Tj is the difference between the states at change point j.

Making the model more complex (by increasing the number of states and by adding additional change points) decreases the loss F but increases the complexity G. The minimum description length MDL is minimal at an ideal balance between these two functions (see the Supplemental Information of B. Shuang et al. (2014)).

Python source code and example with synthesized data

The authors of B. Shuang et al. (2014) provide STaSI as a Matlab script, which was inconvenient for me since the rest of my data analysis is done in python. I therefore implemented my own python version of their algorithm. If you would like to try it out just clone my “algorithms” repository at https://github.com/tobiasbartsch/algorithms, add it to your path, and you are good to go. Here is a usage example with synthesized data:

import numpy as np
import holoviews as hv #importing holoviews for painless plotting
hv.extension('bokeh') #if you are executing this code in a Jupyter notebook you probably want to use bokeh

import sys
sys.path.append('your path to the cloned repository')
import algorithms.STaSI as st

#Create some noisy data (standard deviation = 2) that we can later fit.
#The data contains the following sequence:
#200 points: mean 0
#100 points: mean 2
#100 points: mean 0
#150 points: mean 3

data = np.concatenate((np.random.normal(0,2,200), np.random.normal(2,2,100), np.random.normal(0,2,100), np.random.normal(3,2,150)))

#fit the STaSI model to  the data
fit, means, results, MDLs = st.fitSTaSIModel(data)

#plot the model on top of the data
hv.Scatter(data) * hv.Scatter(fit) + (hv.Curve(MDLs).opts(norm={'axiswise': True})).opts(ylabel='MDL', xlabel='model')

The algorithm correctly identifies the three states and finds their means in good agreement with the ground truth:

The function st.fitSTaSIModel(data) also returns ‘means’, a list of the identified means in the best-fitting model, and ‘results’, a DataFrame representing a table of identified constant segments, their beginning and end indices, their assigned states, and their assigned mean values. Finally, the MDLs array represents the minimum description lengths of all models that were internally generated by the algorithm. Only the model which minimizes the MDL is returned to the user — so in this case model 1 (see the right panel in the figure above).
The ‘results’ DataFrame for this example is in good agreement with the ground truth of our generated data:

startstopmeanstate
00199-0.1156281
11993091.8516822
2309399-0.1156281
33995493.0643553

Average transit times for the north-bound Q line

Using the output of the STaSI algorithm we can now model the mean transit times between any two stations in the subway system, enabling a more careful definition of train delays: a train is delayed if its transit time is higher than the current mean transit time. The global mean transit time could have a very different value and is largely meaningless in the description of train delays.

I will conclude this post by showing graphs of the mean transit times between all pairs of stations in the north-bound Q line. I will discuss these data in a future blog post, but the graph is too beautiful to not include it here as well: