데이터 공부를 기록하는 공간

[arima] smp2 본문

STUDY/ADP, 빅데이터분석기사

[arima] smp2

BOTTLE6 2021. 3. 21. 23:42
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from matplotlib.pyplot import rcParams
rcParams['figure.figsize'] = 10, 6

import itertools

path= './smp/smp.xlsx'
df = pd.read_excel(path, header=1)
df.head(3)

df =df.rename(columns = {'Unnamed: 0':'ym','육지':'smp'})
df = df.drop(['제주','통합','Unnamed: 4'],axis=1)
df['ym'] = pd.to_datetime(df.ym, format='%Y-%m')
df = df.set_index('ym')
df = df[df.smp>0]
df = df.reset_index().sort_values(by='ym', ascending=True).set_index("ym")
# raw data fig
fig = df.plot()
plt.title("SMP")

 

 

 

■ 정상성 확인하기

# determine rolling statistics

rolmean = df.rolling(window=12).mean()
rolstd = df.rolling(window=12).std()
print(rolmean, rolstd)

# plot rolling statistics
orig = plt.plot(df, color='blue', label='Original')
mean = plt.plot(rolmean, color='red', label='Rolling Mean')
std = plt.plot(rolstd, color='black', label='Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)

 

From the above graph, we see that rolling mean itself has a trend component even though rolling standard deviation is fairly constant with time. For our time series to be stationary, we need to ensure that both the rolling statistics ie: mean & std. dev. remain time invariant or constant with time. Thus the curves for both of them have to be parallel to the x-axis, which in our case is not so.

To further augment our hypothesis that the time series is not stationary, let us perform the ADCF test.

 

 

# Perform Augmented Dickey-Fuller test:
print("Result of Dickey Fuller Test:")
dftest = adfuller(df['smp'], autolag='AIC')

dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value', '#Lags Used', 'Number of Observations Used'])
for key, value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)

df['smp_log'] = np.log(df.smp)
df['smp_log_ma'] = df['smp_log'].rolling(window=12).mean()
df['smp_log_std'] = df['smp_log'].rolling(window=12).std()
df['smp_log_ma_diff'] = df['smp_log'] - df['smp_log_ma']
df['smp_diff'] = df['smp'] -df['smp'].shift(1)

def test_stationarity(timeseries):
    
    #Determine rolling statistics
    movingAverage = timeseries.rolling(window=12).mean()
    movingSTD = timeseries.rolling(window=12).std()
    
    #Plot rolling statistics
    orig = plt.plot(timeseries, color='blue', label='Original')
    mean = plt.plot(movingAverage, color='red', label='Rolling Mean')
    std = plt.plot(movingSTD, color='black', label='Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)
    
    #Perform Dickey–Fuller test:
    print('Results of Dickey Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)
test_stationarity(df.smp_log)

p>0.05 = > smp_log : Non Stationary

 

test_stationarity(df.smp_log_ma_diff.dropna())

▶ smp_log_ma_diff는smp_ p

test_stationarity(df.smp_diff.dropna())

dfasdfp<0.05 Stationary

▶ smp_log_ma_diff는 pvalue가 0.05보다 작음 Stationaryvalue가 0.05보다 작음 Stationary

p<0.05 Stationary

 

2. ACF, PACF 그려보기

 

# Differenced data plot

plt.figure(figsize=(12,10))
plt.subplot(411)
plt.plot(df.smp)
plt.legend(["Raw SMP (Non Stationary)"])

plt.subplot(412)
plt.plot(df.smp_log,'orange')
plt.legend(["log transformed SMP (Non Stationary)"])

plt.subplot(413)
plt.plot(df.smp_log_ma_diff,'pink')
plt.legend(["log tranformed + differenced SMP (stationary)"], loc='upper right')

plt.subplot(414)
plt.plot(df.smp_diff,'green')
plt.legend(["differenced SMP (Stationary)"], loc='upper right')

 

## smp_diff 해석
### acp : cut off after lag 0
### pacf : cut off after lag 0

### ARRIMA(0,1,0)
model2 = ARIMA(df.smp_diff.dropna().values, order = (0,1,0))
model2_fit = model2.fit()
model2_fit.summary()

 

'STUDY > ADP, 빅데이터분석기사' 카테고리의 다른 글

빅데이터분석기사실기-XGBOOST 분류  (0) 2021.06.06
빅데이터분석기사 실기 예제 - 작업형#1  (0) 2021.06.05
[arima] smp  (0) 2021.03.21
[pca] iris  (0) 2021.03.21
[clustering] Mall_Customers  (0) 2021.03.21
Comments