The title time series comes from the data having an ordered index (e.g. days, weeks, quarters), usually (and preferably) with equal distances. It has the concept of future, now and past; which means, you cannot randomly sample from the data like you do .
Let’s start with some non-time series data, such as the swiss
data (it is included in base R). You will see the indexes are Swiss regions and you can order them in any way you want or take random samples.
head(swiss[,c("Agriculture","Catholic")])
## Agriculture Catholic
## Courtelary 17.0 9.96
## Delemont 45.1 84.84
## Franches-Mnt 39.7 93.40
## Moutier 36.5 33.77
## Neuveville 43.5 5.16
## Porrentruy 35.3 90.57
ggplot(swiss,aes(x=Agriculture,y=Catholic)) + geom_point()
Now, let’s see some time series data. The data we are going to use is electricity consumption data which can be found in EPİAŞ Şeffaflık / EXIST Transparency platform (click here). You can also download the sample data from 2015 to yesterday (May 6, 2018) from here.
Let’s check the consumption data. The first column is a time index in “year-month-day hour:minute:second” format and the second column is the value column, in this case hourly electricity consumption of Turkey.
print(consumption_data)
## # A tibble: 20,904 x 2
## date_time consumption
## <dttm> <dbl>
## 1 2015-12-18 00:00:00 30070
## 2 2015-12-18 01:00:00 28311
## 3 2015-12-18 02:00:00 27351
## 4 2015-12-18 03:00:00 26740
## 5 2015-12-18 04:00:00 26526
## 6 2015-12-18 05:00:00 27230
## 7 2015-12-18 06:00:00 28319
## 8 2015-12-18 07:00:00 29790
## 9 2015-12-18 08:00:00 34311
## 10 2015-12-18 09:00:00 37337
## # ... with 20,894 more rows
It is also useful to know some date-time transformations to work with the data but we will not cover the topic here. Instead you can learn quickly from these two sources date time with R, lubridate tutorial. This tutorial uses lubridate
functions (they are amazing).
Let’s plot the consumption data. This time, instead of the scatter plot, we use the line plot. Because every data point is connected to previous and next data points.
# consumption_data <-
# readr::read_csv("~/Downloads/GercekZamanliTuketim-01012015-06052018.csv",
# skip=1,col_names=FALSE,
# locale=readr::locale(decimal_mark=",",grouping_mark=".",tz="Turkey")) %>% rename(Date=X1,Hour=X2,consumption=X3) %>% transmute(date_time=lubridate::parse_date_time(Date,orders="dmY")+Hour,consumption)
# save(consumption_data,file="/Users/berkorbay/Dropbox/Courses_given/BOUN_ETM_58D_2017S/Lecture_Notes/week_13/consumption_data.RData")
ggplot(consumption_data,aes(x=date_time,y=consumption)) +
labs(x="Date/Time (Hourly)",
y="Consumption (MWh)",
title="Hourly Electricity Consumption of Turkey") +
geom_line() +
theme_minimal()
Time series data is messy. Do you see the zero value? It is because of the summer/winter time shifts. You need to clean or fix this kind of pecularities before you can work with the time series data. But instead, today we are going to aggregate consumption data to weekly GWh values from hourly MWh data.
consumption_weekly <-
consumption_data %>%
group_by(Date=lubridate::as_date(date_time)) %>%
summarise(consumption=sum(consumption,na.rm=T)/1000) %>%
ungroup() %>%
filter(Date >= "2015-12-21") %>%
# arrange(desc(Date)) %>%
mutate(wday=lubridate::wday(Date,week_start=1)) %>%
group_by(wday) %>%
mutate(weeknum=row_number()) %>%
ungroup() %>%
group_by(weeknum) %>%
summarise(consumption=sum(consumption,na.rm=T)) %>%
ungroup()
The reason we took such a lengthy transformation process is to make sure every week is exactly 7 days. Now let’s plot our weekly data. It is a fun roller coaster (i.e. seasonality). Purple dashed lines indicate year starts.
cons_plot <-
ggplot(consumption_weekly,aes(x=weeknum,y=consumption)) +
labs(x="Week",
y="Consumption (MWh)",
title="Weekly Electricity Consumption of Turkey") +
geom_line() +
geom_vline(xintercept=c(2,55,107),color="purple",size=0.5,linetype="dashed") +
theme_minimal()
cons_plot
Time series data has its own characteristics. Let’s start with identifying the sudden consumption decreases. What might they be? It is always possible that data is wrong. Also check whether consumption is increasing over time (i.e. trend).
cons_plot + geom_smooth(method="lm",se=FALSE)
Even though they can be used for other purposes such as anomaly detection, the main purpose of time series analysis is forecasting. Forecasts are made due to our need to know about the future. What will be the weather tomorrow? What will be the stock market next year? How many units can I sell in the next quarter? If we can see the future with some degree of clarity, we can plan ahead. For instance, we can arrange our production capacity, invest in the most profitable stocks or just take an umbrella on the way out.
Forecasting is never easy and it is rarely stationary. There are many methods and pitfalls. Currently we increasingly use sophisticated tools and mathematical models in forecasting but in the past it was all about human intuition or divination methods such as haruspex.
Forecasting starts with the analysis and identification of components. Fundamental components are trend (e.g. increasing consumption) and seasonality (e.g. ice cream sales). There is also cyclicality (e.g. bear and bull markets). There are special days (e.g. holidays) and their interactions (e.g. weekend vs 4-day holiday). Other external variables are also important (e.g. temperature, daylight). Events (e.g. discount campaigns) according to forecasts can also affect the forecast.
Beware: We also have noise, some randomness. Sometimes, it seems like it can be modeled but you cannot actually. It might also be hard to distinguish noise from signal.
You can (try to) forecast almost anything that will happen in the future and R is your best friend. There are amazing R functions making your life way easier and superb R packages for almost any kind of forecasting method. forecast
package of Rob Hyndman and its associated book (see here) are very fine resources which we will partly follow here.
So let’s load the libraries. Install them if you do not have the libraries.
library(forecast)
library(fma)
library(fpp)
Autocorrelation is correlation but instead of among different variables, we check for correlation between lagged series (e.g. today vs yesterday) of the same variable.
cons_ts <- as.ts(consumption_weekly$consumption,deltat=1/52,frequency=52)
print(cons_ts)
## Time Series:
## Start = 1
## End = 124
## Frequency = 1
## [1] 5461.870 5151.477 5351.725 5278.316 5520.635 5539.354 5257.580
## [8] 5277.803 4931.996 4989.307 4874.317 4877.194 5123.416 5000.016
## [15] 5034.354 4922.256 4965.309 4919.686 5064.609 5018.301 5040.058
## [22] 4901.160 4881.155 5010.810 4973.707 5249.778 5893.636 5691.209
## [29] 4405.583 5804.768 5853.943 5974.774 6075.727 6043.160 5848.992
## [36] 5851.157 5608.761 5448.794 4042.791 4962.589 4945.646 4960.470
## [43] 4927.765 4959.392 4946.075 5074.122 5079.516 5312.894 5414.136
## [50] 5541.272 5621.493 5727.062 5851.498 5498.768 5629.653 5707.258
## [57] 5651.176 5764.715 5742.655 5577.219 5775.314 5479.987 5350.096
## [64] 5287.304 5499.080 5221.240 5184.441 5165.101 5134.689 5091.152
## [71] 5162.964 5094.551 5254.048 5159.827 5121.293 5112.272 5219.161
## [78] 5343.421 5105.879 5419.578 6194.232 6314.642 6285.170 6395.449
## [85] 6380.908 6517.536 6356.117 6104.603 4866.081 5513.819 5944.302
## [92] 5807.028 5450.772 5255.386 5231.599 5224.247 5227.445 5436.747
## [99] 5427.664 5425.230 5733.742 5716.017 5851.673 5783.542 5898.963
## [106] 5732.149 5589.169 5889.455 5930.545 5970.350 5751.984 5631.136
## [113] 5729.768 5719.665 5759.217 5405.711 5329.232 5393.010 5494.758
## [120] 5382.244 5335.816 5364.685 5273.187 5232.958
ggAcf(cons_ts)
Basically, between the blue lines are insignificant indicators of autocorrelation. In our case, up to lag 5 we can say there is some degree of autocorrelation. Between 11 and 14 lags, we see negative values meaning the turn of the cyclical activity.
Now let’s draw the lag plots. Lag plots are scatterplots that show the level of the consumption vs X periods (e.g. weeks) before.
lag.plot(cons_ts, lags=6, do.lines=TRUE)
We talked about the seasonality and trend components of the time series. Now let’s build a new time series object giving the frequency and starting points.
cons_ts2<-ts(consumption_weekly$consumption,start=c(2015,52),frequency=52)
print(cons_ts2)
## Time Series:
## Start = c(2015, 52)
## End = c(2018, 19)
## Frequency = 52
## [1] 5461.870 5151.477 5351.725 5278.316 5520.635 5539.354 5257.580
## [8] 5277.803 4931.996 4989.307 4874.317 4877.194 5123.416 5000.016
## [15] 5034.354 4922.256 4965.309 4919.686 5064.609 5018.301 5040.058
## [22] 4901.160 4881.155 5010.810 4973.707 5249.778 5893.636 5691.209
## [29] 4405.583 5804.768 5853.943 5974.774 6075.727 6043.160 5848.992
## [36] 5851.157 5608.761 5448.794 4042.791 4962.589 4945.646 4960.470
## [43] 4927.765 4959.392 4946.075 5074.122 5079.516 5312.894 5414.136
## [50] 5541.272 5621.493 5727.062 5851.498 5498.768 5629.653 5707.258
## [57] 5651.176 5764.715 5742.655 5577.219 5775.314 5479.987 5350.096
## [64] 5287.304 5499.080 5221.240 5184.441 5165.101 5134.689 5091.152
## [71] 5162.964 5094.551 5254.048 5159.827 5121.293 5112.272 5219.161
## [78] 5343.421 5105.879 5419.578 6194.232 6314.642 6285.170 6395.449
## [85] 6380.908 6517.536 6356.117 6104.603 4866.081 5513.819 5944.302
## [92] 5807.028 5450.772 5255.386 5231.599 5224.247 5227.445 5436.747
## [99] 5427.664 5425.230 5733.742 5716.017 5851.673 5783.542 5898.963
## [106] 5732.149 5589.169 5889.455 5930.545 5970.350 5751.984 5631.136
## [113] 5729.768 5719.665 5759.217 5405.711 5329.232 5393.010 5494.758
## [120] 5382.244 5335.816 5364.685 5273.187 5232.958
We are going to use a method called STL (Seasonal and Trend decomposition using Loess). Loess (Local Weighted Scatterplot Smoother) is the nonlinear smoothing curve; similar to regression, but obviously nonlinear.
stl(cons_ts2,t.window=5,s.window="periodic", robust=TRUE) %>% autoplot()
STL is actually not bad at all. There are some specific errors on localized points but the rest look like ok.
Simple moving average (\(MA(X)\)) just takes the last X observation’s average. Let’s try to forecast the next week with MA(3) and MA(5). The forecast seems a bit lagged, isn’t it?
the_plot <-
consumption_weekly %>%
mutate(ma_3=lag(rollmean(consumption,3,align="right",fill=NA),1),ma_5=lag(rollmean(consumption,5,align="right",fill=NA),1)) %>%
gather(key=series,value,-weeknum) %>%
ggplot(.,aes(x=weeknum,y=value,color=series)) +
geom_line()
the_plot
## Warning: Removed 8 rows containing missing values (geom_path).
Autoregression is regression with lagged values of the time series. The formula for AR(k) is given below.
\[x_t = \beta_0 + \beta_1x_{t-1} + \beta_1x_{t-2} + \dots + \beta_1x_{t-k} + \epsilon\]
You can calculate AR(k) with ar
function. Let’s try AR(5). The model spits out the coefficients.
the_model <- ar(cons_ts,FALSE,5)
print(the_model)
##
## Call:
## ar(x = cons_ts, aic = FALSE, order.max = 5)
##
## Coefficients:
## 1 2 3 4 5
## 0.6382 0.0496 0.0879 0.0838 -0.1063
##
## Order selected 5 sigma^2 estimated as 86093
Let’s predict the consumption for the last 5 periods.
the_prediction <- predict(the_model,newdata=cons_ts[-(120:124)],n.ahead=5)
print(the_prediction)
## $pred
## Time Series:
## Start = 120
## End = 124
## Frequency = 1
## [1] 5423.962 5420.590 5437.342 5443.394 5431.044
##
## $se
## Time Series:
## Start = 120
## End = 124
## Frequency = 1
## [1] 293.4161 348.0843 373.0156 392.0446 411.4034
Now let’s finally plot it vs the actual values.
the_plot <-
consumption_weekly %>% left_join(.,
tibble(weeknum=120:124,
forecast=as.vector(the_prediction$pred),
hi=forecast+the_prediction$se,lo=forecast-the_prediction$se),by="weeknum") %>%
ggplot(.,aes(x=weeknum)) +
geom_line(aes(y=consumption,color="Actual")) +
geom_ribbon(aes(ymin=lo,ymax=hi),fill="blue",alpha=0.3) +
geom_line(aes(y=forecast,color="Forecast"))
the_plot
There are multiple error measures for forecasting errors. Some are meaningful in certain concepts, some are not. There are usually a few metrics analysts follow for their specific problems. Two main error types are absolute scale errors and relative scale errors.
In absolute scale errors, simple difference is (Actual-Forecast) calculated. The absolute of this error is Absolute Error. Mean of this absolute error is Mean Absolute Error (MAE).
Some cases require scaled values such as financial time series or very volatile base valued series. Then we scale the error with the actual value (e.g. (Actual-Forecast)/Actual). It is called relative or percentage error. Mean value of the absolute percentage error is called MAPE.
Here is a handy function to get the descriptive error statistics of your forecasts.
accu=function(actual,forecast){
n=length(actual) #Length of the vector, number of periods
error=actual-forecast #Also known as BIAS or deviance, NOT A METRIC
meanval=mean(actual) #Mean Error
stdev=sd(actual) #Standard Deviation of Error
CV=stdev/meanval #Coefficient of Variation
AD=abs(actual-meanval) #Absolute Deviation, NOT A METRIC
R2=1-sum(error^2)/sum((actual-meanval)^2) #R-squared
#AdjR2=1-(1-R2)*(n-1)/(n-k-1)
DB=sum(diff(error)^2)/sum(error^2)
#FE=sqrt(sum(error^2)/(n-k))
FBias=sum(error)/sum(actual) #Relative total error
MPE=sum(error/actual)/n #Mean Percentage Error
MAPE=sum(abs(error/actual))/n #Mean Absolute Percentage Error
RMSE=sqrt(sum(error^2)/n) #Root Mean Squared Error
MAD=sum(abs(error))/n #Mean Absolute Deviation
MADP=sum(abs(error))/sum(abs(actual)) #Mean Absolute Deviation Percentage
MASE=MAD/mean(abs(diff(actual))) #Mean Absolute Standard Error
RAE= sum(abs(error))/sum(abs(actual-meanval)) #Relative Absolute Error
WMAPE=MAD/meanval #Weighted MAPE
l=data.frame(n,meanval,stdev,CV,R2,DB,FBias,MPE,MAPE,RMSE,MAD,MADP,MASE,RAE,WMAPE)
return(l)
}
Let’s try the error measurement for in-sample estimation of STL.
stl_model <- stl(cons_ts2,t.window=5,s.window="periodic", robust=TRUE)
comparison_table <-
stl_model$time.series %>% as.tibble() %>% transmute(prediction=seasonal+trend) %>% bind_cols(consumption_weekly,.)
accu(comparison_table$consumption,comparison_table$prediction)
## n meanval stdev CV R2 DB FBias
## 1 124 5431.405 421.4474 0.07759455 0.8399875 2.2028 -0.0007076739
## MPE MAPE RMSE MAD MADP MASE RAE
## 1 -0.001646773 0.01271655 167.9044 66.13872 0.01217709 0.3713248 0.1954779
## WMAPE
## 1 0.01217709
ARIMA, ETS and Prophet models are briefly introduced in this part.
ARIMA(p,d,q) is Autoregression Integrated Moving Average model for time series forecasting. The model takes both autoregressive and moving average components. \(p\) is for AR part, \(d\) is for differencing and \(q\) is for MA part. See this tutorial for more information.
Some special cases of ARIMA are equivalent to ETS.
auto.arima
in Rarima_model <- auto.arima(cons_ts)
summary(arima_model)
## Series: cons_ts
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## -0.3235
## s.e. 0.0948
##
## sigma^2 estimated as 91957: log likelihood=-876.97
## AIC=1757.94 AICc=1758.04 BIC=1763.57
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -2.191192 300.7885 178.9205 -0.2665584 3.38283 1.004519
## ACF1
## Training set 0.03639094
For our time series data, ARIMA(0,1,1) is the best fit which means the first order differencing (I) with MA(1). Best ARIMA setting is determined by a metric called AIC (Akaike Information Criterion). AIC is the penalization of log-likelihood with the number of parameters used.
Exponential time series smoothing consist of three different types; exponential smoothing, double exponential smoothing and holt-winters. In the most basic model, the next period is determined by a weighted average of the current period and past periods.
\[\hat{y_{t+1}} = \alpha y_t + \alpha(1-\alpha) y_{t-1} + \alpha(1-\alpha)^2 y_{t-2} + \dots\]
The \(\alpha\) parameter between 0 and 1 is called the smoothing parameter.
Simple exponential smoothing always returns the last point forecast. Further periods are equivalent to the last forecast but the upper/lower limits are affected.
ses(cons_ts)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 125 5255.688 4870.214 5641.161 4666.156 5845.219
## 126 5255.688 4790.992 5720.384 4544.996 5966.379
## 127 5255.688 4723.433 5787.942 4441.675 6069.700
## 128 5255.688 4663.533 5847.842 4350.065 6161.310
## 129 5255.688 4609.159 5902.216 4266.907 6244.468
## 130 5255.688 4559.015 5952.360 4190.220 6321.156
## 131 5255.688 4512.247 5999.129 4118.693 6392.682
## 132 5255.688 4468.251 6043.124 4051.407 6459.968
## 133 5255.688 4426.586 6084.789 3987.686 6523.689
## 134 5255.688 4386.917 6124.458 3927.018 6584.357
Holt’s method adds a trend parameter to SES. It is also called double exponential smoothing.
holt(cons_ts)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 125 5254.098 4868.598 5639.599 4664.526 5843.671
## 126 5253.019 4788.288 5717.751 4542.274 5963.765
## 127 5251.940 4719.625 5784.256 4437.834 6066.047
## 128 5250.862 4658.608 5843.115 4345.088 6156.635
## 129 5249.783 4603.107 5896.458 4260.778 6238.787
## 130 5248.704 4551.829 5945.578 4182.926 6314.481
## 131 5247.625 4503.919 5991.330 4110.225 6385.025
## 132 5246.546 4458.775 6034.317 4041.754 6451.337
## 133 5245.467 4415.956 6074.977 3976.840 6514.094
## 134 5244.388 4375.128 6113.648 3914.970 6573.806
In this model we also add the seasonality parameters. We will use the monthly consumption data to forecast with Holt Winter’s.
consumption_monthly <- consumption_weekly %>% group_by(month=ceiling(weeknum/4)) %>% summarise(consumption=sum(consumption)) %>% ungroup()
cons_ts3 <- ts(consumption_monthly$consumption,start=c(2015,12),frequency=12)
print(cons_ts3)
## Jan Feb Mar Apr May Jun Jul
## 2015
## 2016 21595.37 19672.81 20080.04 19967.91 19833.18 21808.33 22039.07
## 2017 22687.18 22735.76 21892.70 21069.86 20483.36 20647.44 21088.04
## 2018 23084.98 23109.74 23284.01 22614.36 21599.24 21206.65
## Aug Sep Oct Nov Dec
## 2015 21243.39
## 2016 23819.04 20062.93 19793.27 20412.61 22303.96
## 2017 25189.49 25359.17 22131.23 21162.00 21517.09
## 2018
hw(cons_ts3)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jul 2018 23093.18 21678.44 24507.93 20929.51 25256.86
## Aug 2018 25962.95 24548.21 27377.70 23799.28 28126.63
## Sep 2018 24412.27 22997.52 25827.02 22248.60 26575.94
## Oct 2018 22395.15 20980.40 23809.90 20231.48 24558.82
## Nov 2018 22272.95 20858.20 23687.70 20109.27 24436.62
## Dec 2018 23501.18 22086.43 24915.93 21337.50 25664.85
## Jan 2019 24319.72 22904.97 25734.47 22156.04 26483.39
## Feb 2019 23896.28 22481.53 25311.03 21732.61 26059.96
## Mar 2019 23809.01 22394.25 25223.76 21645.33 25972.68
## Apr 2019 23240.26 21825.51 24655.02 21076.59 25403.94
## May 2019 22581.96 21167.21 23996.71 20418.28 24745.64
## Jun 2019 22998.65 21583.90 24413.41 20834.97 25162.33
## Jul 2019 24058.46 22495.49 25621.43 21668.11 26448.81
## Aug 2019 26928.23 25365.26 28491.20 24537.87 29318.58
## Sep 2019 25377.54 23814.57 26940.51 22987.19 27767.90
## Oct 2019 23360.42 21797.45 24923.40 20970.07 25750.78
## Nov 2019 23238.22 21675.25 24801.19 20847.86 25628.58
## Dec 2019 24466.45 22903.47 26029.43 22076.09 26856.82
## Jan 2020 25284.99 23722.01 26847.97 22894.62 27675.36
## Feb 2020 24861.56 23298.58 26424.54 22471.19 27251.93
## Mar 2020 24774.28 23211.30 26337.26 22383.90 27164.66
## Apr 2020 24205.54 22642.55 25768.52 21815.16 26595.92
## May 2020 23547.23 21984.24 25110.22 21156.85 25937.62
## Jun 2020 23963.93 22400.93 25526.92 21573.53 26354.32
If you want R to automatically determine the best parameter setting for ETS, use the ets
function. It determines the nature of error, trend and seasonality components automatically and returns the best model possible.
ets(cons_ts3)
## ETS(M,N,N)
##
## Call:
## ets(y = cons_ts3)
##
## Smoothing parameters:
## alpha = 0.9999
##
## Initial states:
## l = 21011.9714
##
## sigma: 0.0646
##
## AIC AICc BIC
## 561.5532 562.4421 565.8552
Prophet is Facebook’s special forecasting library. In their own words “At its core, the Prophet procedure is an additive regression model with four main components:
You can see the details from here.
library(prophet)
consumption_prophet <- consumption_weekly %>% transmute(ds=seq(from=as.Date("2015-12-21"),to=as.Date("2018-05-06"),by=7),y=consumption)
prophet_model <- prophet(consumption_prophet,weekly.seasonality=TRUE)
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
## Initial log joint probability = -2.3182
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
future <- make_future_dataframe(prophet_model,periods=10)
prophet_forecast <- predict(prophet_model, future)
the_plot <- plot(prophet_model,prophet_forecast)
the_plot
prophet_plot_components(prophet_model,prophet_forecast)