Bayesian Inference Machine Learning NYC Subway

Analysis of one month of subway data – part 2: correlation between the total number of trains and the probability of delays

When I first decided to track trains in the NYC subway system I was sure I would find a correlation between the number of trains in the system and the percentage of trains running with delay. I was also sure it would be a positive relationship. The more trains there are in the system, the more traffic jams, and the more delays one should find — simple, right?

Surprisingly, as we will see in the following, I was quite wrong.

The fraction of trains running with delays correlates negatively with the total number of trains in the subway system

When we plot the fraction of delayed trains (normalized by its standard deviation) versus the number of trains in the system (expressed in z-scores), we find hints of a negative relationship (left panel of the graph below). When plotted on a log-linear scale, this relationship becomes fairly obvious (right panel).

For the remainder of this post, all variables should be considered normalized as shown above, even if this is no longer explicitly mentioned.
We conclude that 1) subway delays are well approximated by a LogNormal distribution, and 2) a negative correlation exists between the number of trains in the system and the probability that one of the trains will run late. What is the causal origin of this relation? Unfortunately, without more data, this is hard to say. For example, it might be that a reduced number of trains is at times not sufficient to transport a high number of would-be passengers, resulting in overcrowded subways for which the boarding process at each station takes an unreasonably long time. This situation certainly does occur from time to time, but I am unsure whether it is the main cause of the observed negative correlation. Another possible explanation is that the MTA schedules track repairs during off-peak hours — the same time during which a reduced number of trains operate. Hence a correlation exists between the “reduced number of trains” and “delays due to scheduled maintenance” which agrees with the observed effect.

The following analysis deals with the quantification of the observed correlation.

A hierarchical Bayesian model of subway-delay probability

In the following we will model the means of the log-fraction of delays as a linear function (defined by a y-intercept and a slope) of the number of trains in the subway system. How, though, should we treat the different categories of data (south- vs northbound trains, weekends vs weekdays, see part 1 of this analysis)? Our plots above contain data from all of these categories, but it is interesting to analyze how the delay probability in each of these categories varies with the total number of trains. Our model therefore contains four y-intercepts and four slopes, one each for every category. It is unlikely, however, that these parameters are entirely independent from one another: for example, if we know the mean of one of the y-intercepts, it is a safe bet that the means of the other y intercepts are of a similar magnitude. In other words, there is a population distribution of intercepts (and a different population distribution of slopes) that we can infer. A model in which the priors of some (or all) of the parameters depend on the distribution of other parameters is called “hierarchical” or “multilevel”. Such priors are also referred to as adaptive priors. Here is the model we will fit to the data:

\begin{align} \mbox{delayprob}_i &\sim \mbox{Normal}(\mu_i, \sigma) \\ \mu_i &= \mbox{intercept}_i + \mbox{a_num_trains}_i \cdot \mbox{N}_{\mbox{trains}} \\ \mbox{intercept}_i &\sim \mbox{Normal}(\mbox{intercept_pop_mean}, 2, \mbox{shape}=4) \\ \mbox{a_num_trains}_i &\sim \mbox{Normal}(\text{a_num_trains_pop_mean}, 2, \mbox{shape}=4) \\ \sigma &\sim \mbox{HalfNormal}(2) \\ \mbox{intercept_pop_mean} &\sim \mbox{Normal}(0, 0.5) \\ \mbox{a_num_trains_pop_mean} &\sim \mbox{Normal}(0, 0.5) \end{align}

The index i runs from 0 to 3 and indicates the category of each datum. Remember that 0: southbound/weekend, 1: northbound/weekend, 2: southbound/weekday, 3: northbound/weekday.

translated to pymc3:

import pymc3 as pm
from theano import shared 

#create theano shared variables (see part 1 of this series of blog posts)
#normalized delays
dels = shared(dataframe['delaypercent_norm'].values) 

#normalized number of trains
numtrains_norm = shared(dataframe['numtrains_norm'].values) 

#category
categories = shared(dataframe['category'].values) 

with pm.Model() as model:
        #prior for population of intercept means
        intercept_pop_mean = pm.Normal('intercept_pop_mean', mu=0, sd=0.5) 
        
        #prior for population slope means
        a_numtrains_pop_mean = pm.Normal('a_numtrains_pop_mean', mu=0, sd=0.5) 
       
        #adaptive prior for intercepts, one for each category
        intercept = pm.Normal('intercept', mu=intercept_pop_mean, sd=2, shape=4) 
        
        #adaptive prior for slopes, one for each category
        a_num_trains = pm.Normal('a_num_trains', mu=a_numtrains_pop_mean, sd=2, shape=4) 
        
        #linear relationship
        mu_delays = intercept[categories] + a_num_trains[categories] * numtrains_norm 

        #standard deviation
        sigma = pm.HalfNormal('sigma', sd=2) 
        
        #likelihood
        delays = pm.Lognormal(name='delays', sd=sigma, mu=mu_delays, observed=delaypercent_norm) 
        
        #draw samples from posterior:
        trace = pm.sample(1000, tune=1000, chains=2, cores=2)

In contrast to the brief introduction to pymc3 in part 1 of this series, we do not specify a sampler in this code. In this case pymc3 automatically chooses a suitable algorithm for us — here a variant of Hamiltonian Monte Carlo called the NUTS sampler. Details of this algorithm are beyond the scope of this post, but I have found Richard McElreath’s book Statistical Rethinking to provide a very easily accessible introduction to Hamilton Monte Carlo sampling. It is always a good idea to inspect the traces of draws from the posterior distributions:

posterior distributions (left) and trace plots (right)

The traces look well behaved and the autocorrelation time is much shorter than the chain length — an advantage of the NUTS over the Metropolis sampler. We can next plot the means and 94% confidence intervals of our parameters by calling pm.plot_forest(trace):

Means and 94% confidence intervals for the parameters in our model.
0: southbound/weekend, 1: northbound/weekend,
2: southbound/weekday, 3: northbound/weekday.

What are the major results from this plot? First, the northbound system (categories 1 and 3) has a lower baseline of delays than the southbound system (the intercept values are more negative than those of categories 0 and 2). The southbound system, on the other hand, has a stronger negative correlation between the total number of trains and the number of subway delays for both weekends and weekdays (slopes “a_num_trains” for categories 0 and 2). Lastly, we do not learn much about the population means: the confidence interval is huge and even includes positive values. This is maybe not too surprising — we only have four categories, which is not a large enough amount of data to swamp the rather uninformative priors of the population means.