Bayesian Inference Machine Learning NYC Subway Unsupervised Learning

Detecting and predicting delays in the NYC subway system

Figure 1: Transit time between pairs of stations along the northbound Q line.

Blue: Each datum corresponds to one observed train. Red: Fit of a minimum-description length-based model from single-molecule physics to the data. The table shows the number of identified states as well as their mean transit times.

Figure 2: Identification of delayed trains (transit time > mean + 3 sdevs).

Number of delayed trains detected by my algorithm (blue) compared to that of the MTA (red). My algorithm reports on average six times more delays than the algorithm of the MTA. Trains that both algorithms classify as delayed are identified ~ 90 seconds sooner by my algorithm (confirmed by cross-correlation, data not shown).

Figure 3: Posterior predictive plot of delay probabilities

Average “delay probability” vs “time of day” for a typical business day in the northbound subway system. Computed by Bayesian inference (pymc3/MCMC) using a log-normal likelihood function from the following features: total number of trains, time of day, and day of week. The model was trained using 276480 observations of delays during July 2019. The shaded area indicates a 95% credible interval. The probability of encountering a delay increases by 25% between 8 am and 10 am.

Backup

Figure 4: Variation in spacing between trains