In this article, I will introduce the topic of time series forecasting and provide a high-level overview of the concepts and practices used by forecasters when building time series models.
What is Time Series Data?
Time series data is simply any time-ordered dataset. The time component serves as the primary axis and the remaining data can be either univariate or multivariate (single-stream or multi-stream). Typically when we think of time series we think of equally spaced, discrete measurements taken successively over time. But there are cases where the time intervals would be inconsistent.
For example, when sampling from sensor data the records can be either time-driven or event-driven. A time-driven record is where a measurement or reading is taken at evenly-spaced time intervals. An event-driven record is (just as the name implies) when a measurement or reading is triggered by some event. In an event-driven sampling scenario we could expect to see inconsistent time intervals in the data.
Time series data is the fastest growing data category due to the proliferation of sensors, IoT devices, and mobile technologies. And the granularity is often high resolution. Sometimes, as in the case of autonomous vehicles, new information being generated by the millisecond.
Time series are typically plotted as line charts (or run sequence plots) with time on the x-axis. Below is an example of a time series plot of financial trade activity measured every 5 minutes.
The Fallacies of Forecasting
Forecasting has been around a long time. We have been trying to predict the future since ancient times by enlisting the help of prophets, oracles, and soothsayers. Their predictions came at a high cost and were ambiguous and unreliable at best. But now, thanks to mathematics and statistics, we have a way to gain insights about the future without the high cost of a drug-induced maiden.
Today, all we need is data, a statistician, and a computer to be able to make inferences about the future. This is true if (and only if) you have the right data and you are asking the right question. In reality, forecasting can be incredibly complex and difficult even for the most sophisticated statistician. It all depends on the nature and behavior of the data and what you are trying to achieve with it.
Generally speaking, reliable and well-behaved data produces reliable and well-behaved forecasts. Less behaved data requires additional information or alternative modeling techniques in order to explain the behavior. Sometimes forecasts are little better than random guessing. But let’s look at the scenarios where forecasting has proven useful.
Who Needs Forecasting?
Forecasting is a well known tool in the business world where it is used heavily in financial and operational planning. If you are buying parts or products from an offshore factory, and it takes 8 weeks for them to manufacture and deliver to your door, then you want to have some idea of how many “widgets” you need to buy. If you are a financial portfolio manager, you want to know the best time to buy, sell, and trade your assets. If you are the CEO of a company, you are interested in knowing the forecasted financial health of your business. Below are some of the most common forecasting use cases that are in practice today:
- Retail Operations: Predicting demand for a product in a physical or online store.
- Service Operations: Predicting airline flight capacity or taxi fleet activity.
- Warehousing: Predicting raw material requirements and inventory SKU counts.
- Staffing: Predicting workforce requirements and hiring activities.
- IT: Predicting IT infrastructure utilization and compute requirements.
- Internet: Predicting web-traffic patterns.
- Business: Predicting revenue, sales, expenses and cash flow.
- Industrial IoT: Predicting machine maintenance and failure.
- Financial: Predicting financial indicator performance from stocks, bonds, funds, and exchange rates.
Properties of Time Series Data
When we talk about time series there are a few properties we need to look for and take into consideration when we formulate a forecast.
Stationary data is data that maintains the same statistical distribution over time with a constant mean and standard deviation. We can identify stationary data by looking at the run sequence plot and observing whether or not a trend line exists or if there are any seasonal patterns. In forecasting, it is preferred to work with stationary data. If the data is non-stationary, then we perform a differencing operation that transforms the data points to reflect the differences between input values at time intervals.
When we make our data stationary we are essentially removing the effect of time and turning our dataset into a standard statistical distribution that is much easier to work with and build statistically based forecasting models with.
Identifying a trend is easy to do from a run sequence plot. Is the data increasing or decreasing over time? If there is a lot of noise in the data you can perform a smoothing operation (such as moving average) to make it easier to identify.
Seasonality is when cyclical patterns are detected in the data. There are many ways to verify and check for seasonality. It can often be identified with a simple run sequence plot, but you may also want to check with a seasonal subseries plot or multiple boxplots. Sometime an autocorrelation plot can also indicate seasonality.
Autocorrelation occurs when data is correlated to itself, meaning the value of any given data point is correlated to preceding values at a specified time interval. This is often the case in time series. You can evaluate autocorrelation by looking at an autocorrelation plot or performing a Durbin Watson test.
The plot shown below is an autocorrelation plot with a 95% confidence band. This plot shows a strong, positive, and slow-decay autocorrelation relationship between time intervals.
Partial autocorrelation is slightly nuanced from total autocorrelation in that it takes into account the correlations of preceding lag values and removes them from the equation. A partial autocorrelation plot strives to show the true or absolute correlation of a lag on any given observation.
The plot shown below indicates significant positive correlations at lag 1 and 2, with a (potential) negative correlation at lag 10. The correlations at other lags are assumed as indirectly contributing to the total correlation at that time.
Preparing Time Series for Forecasting
The initial task before we start modeling is to understand the nature of the dataset and see if we reduce it down to its most simple, elemental pieces. This is called decomposition.
Decomposition is the process of extracting out all the information from a time series that can be explained by trends, seasonality, and cyclical patterns. What is left should be random and statistically stationary.
The goal of decomposition is to model each of the components. A decomposition model can be assembled as an additive or multiplicative model. An additive model is used when variations around the trend do not vary with the level (or amplitude) of the observation. If a proportional relationship exists then a multiplicative model would be an appropriate choice.
Forecasting with decomposition may or may not work well. It all depends on the behavior of the data. However it is a good place to start to understand your dataset and can lead you to selecting the right forecast model.
Differencing is a method of transforming a time series to remove seasonality and trends in order to make the dataset stationary. Differencing can be performed multiple times on the same dataset to remove both seasonality and trend cycles.
Data differencing is an important transformation technique that does two things: It makes the data stationary and stabilizes the mean of the time series.
Time Series Forecast Modeling Techniques
There are a lot of forecasting techniques and each technique comes with its own set of variations that are used depending on the data. Below I provided a list of extremely brief descriptions of each of the most common forecasting techniques that are used in practice today.
The ARIMA Family
- AR(p) – Autoregressive: The output variable depends linearly on its own previous values of a stochastic term. Used on univariate data that is not always stationary. Parameter ‘p’ is the number of time-unit lags that significantly impact the current observation value.
- VAR(p) – Vector Autoregressive: The output is a generalized AR model that captures the linear interdependencies among multiple time series by allowing for more than one evolving variable. Basically, a version of AR that allows for multiple time series to be expressed in a vector. It is used on multivariate data that may, or may not, be stationary. Parameter ‘p’ is the number of time-unit lags that significantly impact the current observation value.
- MA(q) – Moving Average: The output variable depends linearly on the previous values of a stochastic term. Think of it as a “filter” that reduces noise in the data. Parameter ‘q’ is the window size of the moving average function. Variations include “simple”, “cumulative”, “weighted”, and “exponential”.
- ARMA(p, q) – Autoregressive Moving Average: The output variable is a weighted sum of past error terms. Works well for short term forecasting. Parameter ‘p’ is the number of time-unit lags that significantly impact the current observation value and parameter ‘q’ is the window size of the moving average function.
- VARMA(p, q) – Vector Autoregressive Moving Average: Output is a generalized multivariate ARMA model. Parameter ‘p’ is the number of time-unit lags that significantly impact the current observation value and parameter ‘q’ is the window size of the moving average function.
- ARIMA(p, d, q) – Autoregressive Integrated Moving Average: Similar to an ARMA model but uses differencing to make the data stationary. An ARIMA model can be viewed as a filter that separates the signal from the noise, such that the signal can be extrapolated into the future to obtain forecasts. Parameter ‘p’ is the number of time-unit lags that significantly impact the current observation value, parameter ‘q’ is the window size of the moving average function, and parameter ‘d’ is the differencing step size.
- VARIMA – Vector Autoregressive Integrated Moving Average: Similar to a VARMA model but adjusts for trends in the input data by using differencing.
- SARIMA(p, d, q)(P, D, Q)m – Seasonal Autoregressive Integrative Moving Average: Similar to an ARIMA model but has the ability to account for seasonality and repeating signals. Parameters ‘p’, ‘d’, and ‘q’ are the lags, differencing, and window size specific for addressing the trend in the data where ‘P’, ‘D’, ‘Q’ and ‘m’ are the seasonal parameters with ‘m’ as the seasonal time step for a single seasonal period.
- FARIMA or ARFIMA – Fractional Autoregressive Integrated Moving Average: Similar to an ARIMA model but allows for non-integer values of the differencing parameter. Good for long-range forecasting on non-stationary data.
- ARCH & GARCH- (Generalized) Autoregressive Conditional Heteroskedasticity: Output describes the variance of the current error term as a function of the actual sizes of the previous periods’ error terms. In a retail inventory application, a forecaster would use GARCH to measure the uncertainty of the sales forecasts and use that as a safety stock value.
The Exponential Smoothing Family
- ETS(𝞪) – Exponential Smoothing: The output variable is a weighted sum of past terms, with an exponentially decreasing weight applied to all past observations. Think of it as a low-pass filter to remove high-frequency noise. Requires stationary data and a smoothing parameter alpha ‘𝞪’ (between 0 and 1).
- Double ETS – Double Exponential Smoothing: A double recursive version of ETS that removes trends in the data.
- Triple ETS – Triple Exponential Smoothing: A triple recursive version of ETS that removes seasonality in the data.
The Neural Net Family
- RNN – Recursive Neural Net: RNNs Randomly sample training and test windows from the dataset while feeding in lagged data to account for seasonality.
- MDN – Mixture Density Network: Probabilistic deep neural network where the output is a mixture of Gaussians for all points in the forecast horizon. Good for data with seasonality and a large number of peaks and valleys. Large spikes are an indication of having many associative variables.
- LSTM – Long Short-Term Memory Network: A type of RNN that has a long-term memory capability, controlled by a set of “gates”, that eliminates the vanishing gradient problem that is typical of RNNs. The vanishing gradient problem is also referred to as “exploding” gradients.
- TDNN – Time Delay Neural Network: Feed-forward Neural Network breaks the input into chunks and feeds them in one at a time.
Others Forecasting Models
- Non-Parametric Time Series:Predicts the future value distribution of a given time-series by sampling from past observations. Useful when data is intermittent or sparse and can handle data with variable time deltas.
The time component of time series data poses a unique challenge for a data scientist and it requires them to take a different approach to forecast modeling. In this article, I briefly introduced the topic of time series and reviewed common properties and techniques to consider. In the next installment (Part II) I will go into model selection and evaluation with a walk-through example.