Bayesian Inference Machine Learning NYC Subway

One month of subway tracking – part 1: an overview of the recorded data

What is the best time to catch a subway to beat rush-hour delays? What predictors have the largest impact on whether your train will get to its destination on time? To answer these questions I have been tracking the positions of all trains in the NYC subway system for several months in 20 s intervals. From these data, trains that run with delays can be identified, and, after some data-wrangling (not shown here), the following DataFrame can be constructed (note: we will only discuss data from July 2019 in this blog post. The analysis of an entire year of data will be subject of a future post).

The data

time delaypercent precip tempC day hour numtrains weekday north category
0 1561939313 2.272727 0.0 23.0 6 20 473 0 1 1
1 1561939333 1.666667 0.0 23.0 6 20 479 0 1 1
2 1561939393 0.793651 0.0 23.0 6 20 476 0 0 0
3 1561939413 0.757576 0.0 23.0 6 20 476 0 0 0
4 1561939433 0.649351 0.0 23.0 6 20 475 0 1 1
276475 1564849764 0.793651 0.0 30.0 5 12 523 0 0 0
276476 1564849783 1.746032 0.0 30.0 5 12 526 0 0 0
276477 1564849804 2.171717 0.0 30.0 5 12 527 0 0 0
276478 1564849824 1.515152 0.0 30.0 5 12 526 0 0 0
276479 1564849824 0.833333 0.0 30.0 5 12 526 0 1 1

Time is the time in seconds since Jan 1, 1970 (UTC), delaypercent is the percentage of trains in either the north- or southbound subsystem (see column north) that will arrive at their next station with a delay, precip is the precipitation in millimeters per square meter, tempC is the temperature in degrees Celsius, day indicates the day of the week (where 0 = Monday and 6 = Sunday), hour is the hour past midnight, numtrains is the total number of trains in the subway system, weekday is a Boolean flag indicating whether the datum is from a weekday or weekend, north is ‘1’ if the delays are for the north-bound trains and ‘0’ if they refer to the south-bound trains.

North- vs southbound trains and weekend vs weekday define four categories of data

The data naturally subdivide into four categories (column category in the data frame): weekend/southbound (category 0), weekend/northbound (category 1), weekday/southbound (category 2), weekday/northbound (category 3)

Our goal is to predict the percentage of delayed trains from the predictor variables (number of trains, temperature, and precipitation). In this initial post, we will take a closer look at the predictors before we proceed (in future posts of this series) to build statistical models for Bayesian inference.

The number of trains in the NYC subway system varies dramatically with time

We can plot the number of trains vs time using Holoviews. The only complication is that our time axis is given in UNIX Epoch time (seconds since Jan 1, 1970), and we must find a way to convert this into a time axis that Holoviews understands. First, we use pandas’ to_datetime function to convert the dataframe’s time column to the datetime data type. To deal with time zones, we then let pandas know that our new time stamps are in the UTC time zone, and subsequently convert to US/Eastern by calling .dt.tz_localize(‘UTC’).dt.tz_convert(‘US/Eastern’). Unfortunately, at the time of this blog post, if we would pass this column of time stamps to Holoviews, the time zone information would be discarded, resetting our time stamps to UTC. To circumvent this problem, before handing the data to Holoviews, we call .dt.tz_localize(None) which instructs pandas to retain the converted date and time but to discard the attached time zone information (the new data type will be datetime64[ns], which is not timezone aware).

time = pd.to_datetime(delaysdf['time'],unit='s')\
.dt.tz_localize('UTC')\
.dt.tz_convert('US/Eastern')\
.dt.tz_localize(None)

Plotting with Holoviews is then trivial:

hv.extension('bokeh')
hv.Curve((time, dataframe['numtrains'])).opts(width=800, ylabel='number of trains', xlabel='date')

The total number of trains in the subway system ramps up during rush hour at the beginning and at the end of each business day and remains reduced during weekends and holidays.

A statistical model for the mean number of trains in the NYC subway system

In this post (and those that follow it) we will use pymc3 to find the posterior probability distributions of our regression parameters using Markov Chain Monte Carlo (MCMC). To get our feet wet with pymc3, we can use Bayesian inference to find the posterior distribution of the number of trains traversing the subway system at a particular time (defined in hours past midnight) of a work or weekend day. Here is the model we will use: \[ \begin{align} N_{h, w} &\sim \mbox{Normal}(\mu_{h, w}, \sigma) \qquad &\text{likelihood} \\
\mu_{h, w} &\sim \mbox{Normal}(300, 200) \qquad &\text{prior for means} \\
\sigma &\sim \mbox{HalfNormal}(20) \qquad &\text{prior for standard deviation} \end{align} \]

where the hour index h runs from 0 to 24, and the workday flag w is either 0 or 1 (for a weekend day or a weekday). We can the formulate this model in pymc3 model:

import pymc3 as pm
from theano import shared 

#create theano shared variables. This later allows us to later change their values, for example for posterior prediction.
weekday = shared(dataframe['weekday'].values)
hours = shared(dataframe['hour'].values)
ntrains_obs = shared(dataframe['numtrains'].values)

with pm.Model() as model_numtrains:
    #prior for the mean number of trains, shape = (24, 2)
    num_trains_means = pm.Normal('num_trains_means', mu=300, sd=200, shape=(24, 2))
    
    #prior for the standard deviation of the number of trains
    sigma = pm.HalfNormal('sigma', sd=20)
    
    #access the correct probability distribution based on time of day and whether it is a weekend or weekday
    mu = num_trains_means[(hours,weekday)]
    
    #likelihood
    numtrains = pm.Normal(name='numtrains', mu=mu, sd=sigma, observed =ntrains_obs)
    
    #define the stepper for the MCMC algorithm (here we just use a simple Metropolis step)
    step = pm.Metropolis()
    
    #sample from the posterior distributions
    trace_numtrains = pm.sample(5000, step=step, tune=5000, chains=2, cores=2)

In lines 5-7 we create theano shared variables and set their values to the weekday, hour, and numtrains columns of our dataframe. These variables can later be reset to other values for posterior prediction without us having to redefine the model. Lines 11-20 define the statistical model as discussed above. Line 23 sets the stepper by which we explore the posterior probability distribution of our parameters — here we use a simple Metropolis algorithm. Line 26, finally, instructs pymc3 to run two Markov chains, each drawing 5000 samples from the posterior. You can (and should) make sure that the trace of samples for each parameter looks “reasonable”: the output of pm.traceplot(trace_numtrains) for each parameter should look like confined Brownian motion with an autocorrelation time short compared to the length of the chain. As an example consider the trace of the mean number of trains at noon on weekdays:

Even though the autocorrelation of samples in this trace is a bit high, there are enough independent samples to get a good approximation of the posterior probability distribution of this parameter (see the Gaussian-shaped histogram on the right side of the plot). pm.summary(trace_numtrains) can estimate how many independent samples are contained in each trace: for this example there are only 58 (out of 5000 total) samples.

pm.summary(trace_numtrains)

Posterior prediction of the number of trains

Using the posterior samples in trace_numtrains we can now predict the number of trains in the subway system for an arbitrary time during a weekday or weekend:

#set the theano shared variables: 48 values total. The fist 24 are 0 to 24h on weekdays, the remaining values 0-24h on weekends.
weekday.set_value(np.concatenate((np.repeat(1, 24), np.repeat(0,24))))
hours.set_value(np.concatenate((np.arange(24, dtype='int32'), np.arange(24, dtype='int32'))))

#simulate the number of trains in the system
numtrains_posterior_predictive = pm.sample_posterior_predictive(samples=10000, model=model_numtrains, trace=trace_numtrains, var_names=['numtrains'])

We can then plot the predicted number of trains:

Posterior prediction of the number of trains in the NYC subway system for weekdays (blue) and weekends (red). Shaded areas are 95% confidence intervals. The scatter plot are the recorded data.

Precipitation and temperature during July 2019

Both precipitation and temperature could influence the delay probability in the NYC subway system. A downpour, for example, could drive a larger than usual crowd into the subway stations, which may cause delays during the boarding of trains. The following graphs show the data recorded in the precip and tempC columns of the dataframe.

That’s it! In the next posts we will explore how well these variables do at predicting subway delays.