Amelia N Chu - Data Scientist - Humans. Data. Design.

Hello, there! This is the final project for my forecasting timeseries class— constructing an ARIMA-ARCH model. I'm still working on improving readability/usability. Feel free to peruse in the meantime!

Introduction

The data for this project was retrieved every minute via the real-time Citi Bike station status feed¹ from April 7, 2018 to April 16, 2018. This feed scans for updates every 10 seconds and contains information on the availability of bikes and docks at each station in the Citi Bike system. The LaGuardia Pl & W 3 St Citi Bike station was selected because it had slightly above average bike capacity of 35 bikes (\(\mu = 30.3\)), and appeared to be consistently used (Figure 1 & Figure 2a).

Figure 1. Distribution of Station Bike Capacity

Figure 2. Time Series of Bikes Available

2a. Number of Bikes Available by Minute

2b. Log of Bikes Available by Minute

2c. Diff Log of Bikes Available by Minute

2d. 2 Diff Log of Bikes Available by Minute

The field "num_bikes_available" was used for this project; this field contains the number of bikes available for rental at each station during any given time ². Since the feed only updates when there is a change (e.g. when a bike is rented or returned), the timestamp used for this project is based on the time of data retrieval (i.e. every minute), rather than the reported timestamp. This was done to ensure evenly spaced time intervals.

Part 1: Identify Potential p, d, q for an ARIMA(p, d, q) Model

To determine which ARIMA model is best suited for this dataset (\(n = 12,995\)), we plot the number of bikes available, its log, and its difference (Figure 2). In the plot with the number of bike available by minute, there does appear to be some seasonality. The timeseries does appear to dip and rise in some periodic manner, but it does not appear to be completely consistent. For example, sometimes the lowest point on a day is around 8:00 UTC (e.g. April 9), other days it’s around 12:00 UTC (e.g. April 12), and on April 16 it was on 3:00 UTC. The cycles do not appear to match a weekly cadence either: Monday, April 9 had the lowest point around 8:00 UTC compared to April 16 at 3:00 UTC, and Saturday, April 7 had the lowest point around 14:00 UTC compared to April 14 at 5:00 UTC. Since it is difficult to discern the seasonal pattern, we will not be making seasonal adjustments. It is also possible that weather temperature had an impact on the timeseries. For example, days with larger dips (e.g. April 12 -14) also happen to have warmer weather³. It also appears that when the there are more bikes available, the timeseries is more volatile, which is suggestive of level-dependent volatility and heteroskedasticity, thus it may be necessary to take logs of the data.

Based on the ACF/PACF plots of log bike available, we should take differences (Figure 3a & 3b). Refering back to the plots of the differenced data over time, it appears that both are mean reverting (Figure 2c & 2d). The twice differenced ACF show negative value at lag 1, which suggest overdifferencing. This suggested that we should set \(d = 1\). Looking at the ACF/PACF plots of the differenced log bike available, it appears that both plots are dying down, suggestive of an ARMA model. To follow the guiding principle of parsimony, we will try parameters of 3 or less for \(p\), and \(q\).

Figure 3. ACF and PACF Plots for Log and Differences Bikes Available

3a. ACF: Hanging

3b. PACF: Drops off at Lag 1

3c. ACF: Dying Down

3d. PACF: Dying Down

3e. ACF: Negative ACF at Lag 1 Suggests Overdifferencing

3f. PACF

Part 2: Using AICC to Identify Best p, q for ARIMA(p, 1, q) Model

To help determine the exact \(p\) and determine whether or not to include a constant we use AIC_c.

Figure 4. AIC_C for Candidate ARIMA Models

The candidate model with the lowest AIC_c is ARIMA(3,1,3) without constant (-47050.73; Figure 4).

Figure 5. Final Estimates of ARIMA Parameters

Thus, we will select these parameters and obtain the model:

\(x_{t} = -0.6394x_{t-1} + 0.1644x_{t-2} + 0.599x_{t-3} + \varepsilon _{t} + 0.7143\varepsilon_{t-1} - 0.0968\varepsilon_{t-2} - 0.0667\varepsilon_{t-3} \)

where \(x_{t}\) is log bikes_available\(_{t}\) - log bikes_available\(_{t - 1}\)

Part 3: Residuals of the ARIMA Model

To determine if we have a reasonable model, we will use the Modified Box-Pierce (Ljung-Box) Chi-Square Statistic and examine the residual plots. Using the Modified Box-Pierce Chi-Square Statistic, we see that there is significance at all reported lags (Figure 6). This indicates that there is strong evidence model is inadequate.

Figure 6. Modified Box-Pierce (Ljung-Box) Chi-Square Statistic for ARIMA(3,1,3) Model

Additionally, looking at the time series residual plot, variance is not constant, possibly showing conditional heteroscedasticity (Figure 7). The ACF and PACF of the residuals look like white noise at earlier lags and does not appear significant until much larger lags, suggesting that the data is uncorrelated (Figure 7a & 7b). The ACF and PACF for the squared residuals show that there is much autocorrelation in the squared residuals even at earlier lags suggesting that the data is correlated (Figure 7c & 7d). Since the residual ACF & PACF suggest that the data is uncorrelated and the squared residual suggest correlated, this would indicate that the data is not independent. There appears to be some structure in the data that not captured by the ARIMA model, suggesting that a nonlinear model may be appropriate.

Figure 7. ARIMA Residual Plots

7a. ACF

7b. PACF

7c. ACF

7d. PACF

Part 4: GARCH Model Selected

To determine which ARCH or GARCH model to use, we use AIC_c to evaluate candidate models (Figure 8).

Figure 8a.AIC_C for Candidate ARCH Models

Figure 8b. AIC_C for Candidate GARCH(1,1) Model

The GARCH(1,1) model has the lowest AIC_c (-61607.05). Thus, we will use this model, which yields \(h_{t} = 0.0000005255 + 0.02037\varepsilon^{2}_{t-1} + 0.9812h_{t-1}\). Based on the model output, all estimates are statistically significance with \(p < 2.2e-16\).

Figure 9. Model Outputs for GARCH(1,1) Model

Part 5: Compare One-Step Ahead Forecast Intervals of ARIMA vs. ARIMA-ARCH Model

The 95% one-step ahead forecast for ARIMA is (1.865661, 2.020754) and ARIMA-ARCH is (1.608771, 2.277645). The ARIMA-ARCH forecast interval is wider.

Figure 10. Forecast Intervals for ARIMA and ARIMA-ARCH Model

Part 6: Conditional Variances

Examining the conditional variance plot (Figure 11), periods of high variance do coincide with periods of high volatility in the original plot (Figure 2a). There are spikes around April 12 - April 14 the coincide with the large dips in the timeseries.

Figure 11. Conditional Variance of ARCH Model

Part 7 & 9: Forecast Intervals of the ARIMA/ARIMA-ARCH Model

Looking at the forecast from the ARIMA(3,1,3) Model (Figure 12a), it appears that interval is too narrow. If we were to compare the forecast to historical values the ARIMA model forecast would not encompass the majority of values. In contrast, the ARIMA-ARCH model forecasts encompass the majority of values, only failing 738 of the 12955 times (5.6%).

Figure 12. Log Bikes Available Forecast Intervals

ARIMA Forecast Intervals
ARIMA-ARCH Forecast Intervals

n.b. if the plot does not render completely, try adjusting the window screen...

n.b. if the plot does not render completely, try adjusting the window screen..

Part 8: Residuals of the ARIMA-ARCH Model

The normality plot for ARCH Residuals (Figure 13) appears to exhibit leptokurtosis.

Figure 13. Normality Plot for ARCH Residuals

There is a large number of data points (outliers) that are curved and trailing both the left and right side of the plot. This suggests that the model did not adequately describe the leptokurtosis, because if the model was adequate, the normality plot would be close to a straight line.