data visualization NYC Subway

Towards a realtime map of subway delays

I have long wished a map of the NYC subway system existed that showed delays in realtime, much like modern mapping software displays delays on street maps. Using our understanding of the states a subway line can exist in (including the current state’s mean transit times between stations and their standard deviations) and our ability to plot the NYC subway lines, we are now in a position to build such a map.

Since the rest of this post is rather mathematical and technical I’d like to show you the final result first. Here are a few snapshots of the map — you can pick a different time using the pull down menu. The color code runs from blue through black to red and indicates the probability that a train will arrive at a station with a delay. A large red circle means that a train is certainly late; a mid sized black dot indicates a 50% chance of a delay. Grey circles indicate stations that currently do not have any incoming trains.

I have implemented a continuously updating version of this map in a Jupyter notebook (not shown here). Using that code, in the future I hope to build a webpage that in realtime shows the current status of the subway system.

So how was this map built? First, we have to define what it means for a train to be delayed. Passengers waiting at a station to be picked up will call a train ‘delayed’ when they have waited at the station for an ‘unreasonable’ amount of time. Such long wait times, however, can have two different origins:

  1. Trains may be traveling through the subway system without encountering any hold ups — but the spacing between trains may be uneven. Consider the Q line during rush hour, during which there should be an ~8 min gap between trains. If the distribution of trains becomes uneven, long wait times can arise. For example, if two Q trains are spaced by a 1 minute interval, the third Q train, following the closely spaced pair, likely won’t arrive until 15 minutes after the second train — a very unreasonable wait time during rush hour.
  2. On the other hand, a train may encounter a real issue between two stations. Commonly, for example, a traffic jam might be slowing the train’s progress. Under such circumstances the train will then complete its transit between the two stations within a time that is much longer than the mean transit time. It is easy to see that delays of this second type can eventually give rise to delays of type 1.

It is no wonder that both types of delays frequently occur in the NYC subway system. In the following we will focus on delays of type 2; delays of type 1 will be the subject of a future blog post.

How can we tell whether a train is running ‘unreasonably’ late due to a type 2 delay? The mean transit times between adjacent subway stations are almost normally distributed and the distribution’s heavy tail comprises the abnormally long transit times of delayed trains. We can hence set an empirical cutoff, say at three standard deviations, to separate trains that are “on schedule” from those that are running with delays. Therefore, our map could be color coded using this threshold: if a train has taken longer than three standard deviations beyond the mean transit time to travel between two stations, then its status at the destination station is “delayed”. Therefore,
\[delay(\tau) = 0 \quad \mbox{for} \quad \tau < \mu + 3 \sigma \\ delay(\tau) = 1 \quad \mbox{for} \quad \tau > \mu + 3 \sigma,\]
where τ is the observed transit time between two stations, and μ and σ are the mean transit time and its standard deviation. Recall that μ and σ depend on the current state of the connection between the two stations, i.e. these values depend on factors such as scheduled maintenance or reduced speeds due to damaged tracks. For this analysis we will use last μ and σ that was observed for each pair of stations.

So far I have described a binary color code of ‘on time’ or ‘delay’, but we can do better than that: the time t0 that a passenger has already waited for a train can be used to compute the probability that that train will be delayed (i.e. that it will arrive more than three standard deviations late). We can then introduce a color code for this delay probability and color our map accordingly. All we need to do is compute
\[p(delay | t > t_0) = \frac{p(delay) \quad p(t > t_0 | delay)}{p(t > t_0)}.\]
If we approximate the transit times as Gaussian distributed, then the right hand side of this equation can be evaluated analytically as various integrals over the Gaussian probability density. You can find a full implementation in my Github repository, in particular in “class delayOfTrainsInLine_Bayes” which you can find here. The following hence only discusses relevant code snippets that you should view in context of that class.

The prior, p(delay==True), is simply equal to the area under the far right hand side of the distribution, from σ = 3 to ∞, whereas p(delay==False) can be evaluated as the complementary area, from σ = -∞ to 3. These areas can be easily computed by the cumulative distribution function:

def _p_H(self, n, delayed):
    '''compute the prior.
    
    Args:
        n (float): number of sdevs that define a threshold beyond which we treat a train as delayed (delay for t > n).
        delayed (boolean): the hypothesis: is the train delayed? True/False
    '''
    if(delayed):
        return 1-norm.cdf(n)
    else:
        return norm.cdf(n)

The likelihood, p(t > t0 | delay), can be computed using similar arguments as:

def _likelihood(self, t0, n, delayed):
    '''compute the likelihood of observing t>t0 given that the current train is (not) delayed.
    
    Args:
        t0 (float): current wait time for the inbound train (expressed in multiples of sdev and with zero mean).
        n (float): number of sdevs that define a threshold beyond which we treat a train as delayed (delay for t > n).
        delayed (boolean): the hypothesis: is the train delayed? True/False
        
    Returns:
        likelihood (float)
    '''
    
    like = delayed * (1/(1-norm.cdf(n)) * (1-norm.cdf(np.max([n, t0])))) + (1-delayed) * (1/norm.cdf(n) * (norm.cdf(np.max([t0,n])) - norm.cdf(t0)))
    
    return like

Finally, the probability of a delay is:

def _probability_of_delay(self, t0, mu, sigma, n):
    '''compute the posterior probability: what is the predicted delay probability after we have waited a time t0 for a train.
    
    Args:
        t0 (float): time in seconds since the train left the origin station 
        mu (float): average transit time (in seconds) between the stations
        sigma (float): standard devition (in seconds) of the transit time between stations
        n (float): number of sdevs that define a threshold beyond which we treat a train as delayed (delay for (t_arrival-mu)/sdev > n).
    '''
    #normalize t0
    t0 = (t0-mu)/sigma
    
    return self._p_H(n, delayed=True) * self._likelihood(t0, n, delayed=True) / (self._p_H(n, delayed=True) * self._likelihood(t0, n, delayed=True) + self._p_H(n, delayed=False) * self._likelihood(t0, n, delayed=False))

All that is left to do is to assign a color based on the probability and to update the plot of the subway map as previously described.