-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTractor-Sales-prediction-Arima.Rmd
166 lines (105 loc) · 6.62 KB
/
Tractor-Sales-prediction-Arima.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
title: "Forecast Tractor Sales using Time Series ARIMA Model"
author: "Neha More"
date: "December 3, 2017"
output: html_document
---
###**AIM**
Forecast tractor sales for the next 36 months.
###**PROBLEM STATEMENT:**
PowerHorse, a tractor and farm equipment manufacturing company, was established a few years after World War II. The company has shown a consistent growth in its revenue from tractor sales since its inception. However, over the years the company has struggled to keep it's inventory and production cost down because of variability in sales and tractor demand. The management at PowerHorse is under enormous pressure from the shareholders and board to reduce the production cost.
You have to develop an ARIMA model to forecast sale / demand of tractors for next 3 years.
###**DATA DICTIONARY**
The dataset consists of 144 observations having the total monthwise sales data of Tractors for a period of past 12 years.
###Initial setup:
```{r warning=FALSE}
library(dplyr)
library(forecast)
```
###Step 1: Read and Plot tractor sales data as time series
```{r}
data=read.csv("Tractor-Sales.csv")
#144 obs and 2 variables
```
```{r}
data = ts(data[,2],start = c(2003,1),frequency = 12)
#data is converted into time series.
#NOTE:we only take second column(number of tractor sold) for further analysis.not first column.
plot(data, xlab='Years', ylab = 'Tractor Sales')
```
Clearly the above chart has an upward trend for tractors sales and there is also a seasonal component.
###Step 2: Difference data to make data stationary on mean (remove trend)
The next thing to do is to make the series stationary. This to remove the upward trend through 1st order differencing.
```{r}
plot(diff(data),ylab='Differenced Tractor Sales')
#hence trend is removed.
```
Now, the above series is not stationary on variance i.e. variation in the plot is increasing as we move towards the right of the chart. We need to make the series stationary on variance to produce reliable forecasts through ARIMA models.
###Step 3: log transform data to make data stationary on variance
One of the best ways to make a series stationary on variance is through transforming the original series through log transform. We will go back to our original tractor sales series and log transform it to make it stationary on variance.
Notice, the following series is not stationary on mean since we are using the original data without differencing.
```{r}
plot(log10(data),ylab='Log (Tractor Sales)')
```
Now the above series looks stationary on variance.
###Step 4: Difference log transform data to make data stationary on both mean and variance.
Let us look at the differenced plot for log transformed series to reconfirm if the series is actually stationary on both mean and variance.
```{r}
plot(diff(log10(data)),ylab='Differenced Log (Tractor Sales)')
```
Yes, now the above series looks stationary on both mean and variance.
This also gives us the clue that I or integrated part of our ARIMA model will be equal to 1 as 1st difference is making the series stationary.
###Step 5: Plot ACF and PACF to identify potential AR and MA model
Let us create autocorrelation factor (ACF) and partial autocorrelation factor (PACF) plots to identify patterns in the above data which is stationary on both mean and variance.
The idea is to identify presence of AR and MA components in the residuals.
```{r}
par(mfrow = c(1,2))
```
```{r}
acf(ts(diff(log10(data))),main='ACF Tractor Sales')
```
```{r}
pacf(ts(diff(log10(data))),main='PACF Tractor Sales')
```
Observation by acf and pacf plots:
Since, there are enough spikes in the plots outside the insignificant zone (dotted horizontal lines) we can conclude that the residuals are not random.
This implies that there is juice or information available in residuals to be extracted by AR and MA models.
Also, there is a seasonal component available in the residuals at the lag 12 (represented by spikes at lag 12). This makes sense since we are analyzing monthly data that tends to have seasonality of 12 months because of patterns in tractor sales.
###Step 6: Identification of best fit ARIMA model
Auto arima function in forecast package in R helps us identify the best fit ARIMA model on the fly.
```{r}
library(tseries)
ARIMAfit = auto.arima(log10(data), approximation=FALSE,trace=FALSE)
summary(ARIMAfit)
```
Observations:
The best fit model is selected based on Akaike Information Criterion (AIC) , and Bayesian Information Criterion (BIC) values. The idea is to choose a model with minimum AIC and BIC values.
As expected, our model has I (or integrated) component equal to 1. This represents differencing of order 1.
There is additional differencing of lag 12 in the above best fit model. Moreover, the best fit model has MA value of order 1. Also, there is seasonal MA with lag 12 of order 1.
###Step 7: Forecast sales using the best fit ARIMA model
The next step is to predict tractor sales for next 3 years i.e. for 2015, 2016, and 2017 through the above model.
```{r}
par(mfrow = c(1,1))
pred = predict(ARIMAfit, n.ahead = 36) #36 months makes 3 years
pred
```
```{r}
plot(data,type='l',xlim=c(2004,2018),ylim=c(1,1600),xlab = 'Year',ylab = 'Tractor Sales')
lines(10^(pred$pred),col='blue')
lines(10^(pred$pred+2*pred$se),col='orange')
lines(10^(pred$pred-2*pred$se),col='orange')
```
The above is the output with forecasted values of tractor sales in blue. Also, the range of expected error (i.e. 2 times standard deviation) is displayed with orange lines on either side of predicted blue line.
Assumptions while forecasting:
Forecasts for a long period of 3 years is an ambitious task. The major assumption here is that the underlining patterns in the time series will continue to stay the same as predicted in the model. A short-term forecasting model, say a couple of business quarters or a year, is usually a good idea to forecast with reasonable accuracy. A long-term model like the one above needs to evaluated on a regular interval of time (say 6 months). The idea is to incorporate the new information available with the passage of time in the model.
###Step 8: Plot ACF and PACF for residuals of ARIMA model to ensure no more information is left for extraction.
Finally, let's create an ACF and PACF plot of the residuals of our best fit ARIMA model i.e. ARIMA(0,1,1)(0,1,1)[12].
```{r}
par(mfrow=c(1,2))
acf(ts(ARIMAfit$residuals),main='ACF Residual')
```
```{r}
pacf(ts(ARIMAfit$residuals),main='PACF Residual')
```
Since there are no spikes outside the insignificant zone for both ACF and PACF plots we can conclude that residuals are random with no information or juice in them. Hence our ARIMA model is working good and predictions were successfully made.
### *The End*