The data for this project was retrieved every minute via the real-time Citi Bike station status feed 1 from April 7, 2018 to April 16, 2018. This feed scans for updates every 10 seconds and contains information on the availability of bikes and docks at each station in the Citi Bike system. The LaGuardia Pl & W 3 St Citi Bike station was selected because it had slightly above average bike capacity of 35 bikes (\(\mu = 30.3\)), and appeared to be consistently used (Figure 1 & Figure 2a).
The field "num_bikes_available" was used for this project; this field contains the number of bikes available for rental at each station during any given time 2. Since the feed only updates when there is a change (e.g. when a bike is rented or returned), the timestamp used for this project is based on the time of data retrieval (i.e. every minute), rather than the reported timestamp. This was done to ensure evenly spaced time intervals.
To determine which ARIMA model is best suited for this dataset (\(n = 12,995\)), we plot the number of bikes available, its log, and its difference (Figure 2). In the plot with the number of bike available by minute, there does appear to be some seasonality. The timeseries does appear to dip and rise in some periodic manner, but it does not appear to be completely consistent. For example, sometimes the lowest point on a day is around 8:00 UTC (e.g. April 9), other days it’s around 12:00 UTC (e.g. April 12), and on April 16 it was on 3:00 UTC. The cycles do not appear to match a weekly cadence either: Monday, April 9 had the lowest point around 8:00 UTC compared to April 16 at 3:00 UTC, and Saturday, April 7 had the lowest point around 14:00 UTC compared to April 14 at 5:00 UTC. Since it is difficult to discern the seasonal pattern, we will not be making seasonal adjustments. It is also possible that weather temperature had an impact on the timeseries. For example, days with larger dips (e.g. April 12 -14) also happen to have warmer weather3. It also appears that when the there are more bikes available, the timeseries is more volatile, which is suggestive of level-dependent volatility and heteroskedasticity, thus it may be necessary to take logs of the data.
Based on the ACF/PACF plots of log bike available, we should take differences (Figure 3a & 3b). Refering back to the plots of the differenced data over time, it appears that both are mean reverting (Figure 2c & 2d). The twice differenced ACF show negative value at lag 1, which suggest overdifferencing. This suggested that we should set \(d = 1\). Looking at the ACF/PACF plots of the differenced log bike available, it appears that both plots are dying down, suggestive of an ARMA model. To follow the guiding principle of parsimony, we will try parameters of 3 or less for \(p\), and \(q\).
To help determine the exact \(p\) and determine whether or not to include a constant we use AICc.
The candidate model with the lowest AICc is ARIMA(3,1,3) without constant (-47050.73; Figure 4).
Figure 5. Final Estimates of ARIMA Parameters
Thus, we will select these parameters and obtain the model:
\(x_{t} = -0.6394x_{t-1} + 0.1644x_{t-2} + 0.599x_{t-3} + \varepsilon _{t} + 0.7143\varepsilon_{t-1} - 0.0968\varepsilon_{t-2} - 0.0667\varepsilon_{t-3} \)
where \(x_{t}\) is log bikes_available\(_{t}\) - log bikes_available\(_{t - 1}\)
To determine if we have a reasonable model, we will use the Modified Box-Pierce (Ljung-Box) Chi-Square Statistic and examine the residual plots. Using the Modified Box-Pierce Chi-Square Statistic, we see that there is significance at all reported lags (Figure 6). This indicates that there is strong evidence model is inadequate.
Figure 6. Modified Box-Pierce (Ljung-Box) Chi-Square Statistic for ARIMA(3,1,3) Model
Additionally, looking at the time series residual plot, variance is not constant, possibly showing conditional heteroscedasticity (Figure 7). The ACF and PACF of the residuals look like white noise at earlier lags and does not appear significant until much larger lags, suggesting that the data is uncorrelated (Figure 7a & 7b). The ACF and PACF for the squared residuals show that there is much autocorrelation in the squared residuals even at earlier lags suggesting that the data is correlated (Figure 7c & 7d). Since the residual ACF & PACF suggest that the data is uncorrelated and the squared residual suggest correlated, this would indicate that the data is not independent. There appears to be some structure in the data that not captured by the ARIMA model, suggesting that a nonlinear model may be appropriate.
Figure 7. ARIMA Residual Plots
To determine which ARCH or GARCH model to use, we use AICc to evaluate candidate models (Figure 8).
Figure 8a.AICC for Candidate ARCH Models
Figure 8b. AICC for Candidate GARCH(1,1) Model
The GARCH(1,1) model has the lowest AICc (-61607.05). Thus, we will use this model, which yields \(h_{t} = 0.0000005255 + 0.02037\varepsilon^{2}_{t-1} + 0.9812h_{t-1}\). Based on the model output, all estimates are statistically significance with \(p < 2.2e-16\).
Figure 9. Model Outputs for GARCH(1,1) Model
The 95% one-step ahead forecast for ARIMA is (1.865661, 2.020754) and ARIMA-ARCH is (1.608771, 2.277645). The ARIMA-ARCH forecast interval is wider.
Figure 10. Forecast Intervals for ARIMA and ARIMA-ARCH Model
Examining the conditional variance plot (Figure 11), periods of high variance do coincide with periods of high volatility in the original plot (Figure 2a). There are spikes around April 12 - April 14 the coincide with the large dips in the timeseries.
Figure 11. Conditional Variance of ARCH Model
Looking at the forecast from the ARIMA(3,1,3) Model (Figure 12a), it appears that interval is too narrow. If we were to compare the forecast to historical values the ARIMA model forecast would not encompass the majority of values. In contrast, the ARIMA-ARCH model forecasts encompass the majority of values, only failing 738 of the 12955 times (5.6%).
Figure 12. Log Bikes Available Forecast Intervals
The normality plot for ARCH Residuals (Figure 13) appears to exhibit leptokurtosis.
Figure 13. Normality Plot for ARCH Residuals
There is a large number of data points (outliers) that are curved and trailing both the left and right side of the plot. This suggests that the model did not adequately describe the leptokurtosis, because if the model was adequate, the normality plot would be close to a straight line.