Regression Model for Comsumption Prediction

COVID-19 pandemic has induced many economical problems. From the BEA published dataset in year 2021, we know that the personal income decreased $216.2 billion, or 1.0 percent at a monthly rate, while consumer spending increased $93.4 billion, or 0.6 percent, in September. The decrease in personal income primarily reflected the winding down of pandemic-related assistance programs (BEA, https://www.bea.gov/data/income-saving/personal-income).

For this project, we are interested in exploring how the COVID-19 pandemic affected people’s personal consumption (PCE) patterns. We know that one’s consumption is associate with one’s income, so we will build a model for PCE expenditure using income data and we want to show that there is significant change in PCE during the pandemic that cannot be explained by the income model, which works in pre-pandemic periods.

Model: Predict PCE based on multiple sources of personal income

We are interested in exploring the how the changes in the following sources of personal income might affect PCE:

Wages and salaries: wages and salaries received from employers
Supplements to wages and salaries: supplemental payments received from employers
Social Security: benefits include old-age, survivors, and disability insurance benefits that are distributed from the federal old-age and survivors insurance trust fund and the disability insurance trust fund
Medicare: benefits include hospital and supplementary medical insurance benefits that are distributed from the federal hospital insurance trust fund and the supplementary medical insurance trust fund
Personal Current Taxes: total personal tax payment

PCE:

Personal consumption expenditures (on goods and services)

We want to look at data from 1967 Quarter 1 to 2020 Quarter 4:

format_function <- function(input) {
  str_c(str_split(input, 'q')[[1]][1], ' Quarter ', str_split(input, 'q')[[1]][2])
}

pce_4720 <- read_excel("data/pce_1947_2020.xlsx", sheet = 'T20100-Q', 
                      range = 'A8:KM54') %>% 
  janitor::clean_names() %>%
  select(x2, x1967q1:x2020q4) %>%
  drop_na() %>%
  # We only want to look at several variables:
  filter(
      x2 == 'Wages and salaries' |
      x2 == 'Supplements to wages and salaries' |
      x2 == 'Social security' |
      x2 == 'Medicare' |
      x2 == 'Less: Personal current taxes' |
      x2 == 'Personal consumption expenditures'
  ) %>%
  pivot_longer(
    x1967q1:x2020q4,
    names_to = 'time',
    names_prefix = 'x',
    values_to = 'millions_of_dollars'
  ) %>%
  rename(variable = x2) %>%
  mutate(
    variable = ifelse(variable == 'Less: Personal current taxes', 'Personal current taxes', variable)
  )
  #separate(time, c("year", "quarter"), "q") %>%
  
pce_4720_formatted <- pce_4720 %>%
  mutate(
    time = map(.x = time, format_function)
  )

head(pce_4720_formatted) %>% knitr::kable(caption = 'Example of the Data from BEA')

Example of the Data from BEA
variable	time	millions_of_dollars
Wages and salaries	1967 Quarter 1	418841
Wages and salaries	1967 Quarter 2	423604
Wages and salaries	1967 Quarter 3	432009
Wages and salaries	1967 Quarter 4	441583
Wages and salaries	1968 Quarter 1	454210
Wages and salaries	1968 Quarter 2	465923

Data Overview

Let’s take a brief look at the changes of incomes and PCE over time:

fig_1 <- pce_4720_formatted %>%
  plot_ly(x = ~time, y = ~millions_of_dollars, type = "scatter", mode = "lines", color = ~variable) %>%
  layout(title = "<b> Spaghetti Plot for Personal Income and Dispositions v.s. PCE <b>") %>%
  layout(yaxis = list(title = "<i> Millions of Dollars (M) <i>"), barmode = "stack", 
         xaxis = list(title = "<i> Time <i>"),
         legend = list(title = list(text = '<b> Dispositions </b>'))) %>% 
  layout(legend = list(orientation = 'h', x = 0, y = -0.2))
fig_1

For all categories, there is a exponentially increasing trend as time proceeds. Note that there is a drop for incomes and PCE at year 2020.

We can also look at the time v.s. logarithm dollars graph. Clearly, there is a increasing trend as time proceeded, and there is a drop for multiple source of incomes at year 2020.

fig_2 <- pce_4720_formatted %>%
  plot_ly(x = ~time, y = ~log(millions_of_dollars), type = "scatter", mode = "lines", color = ~variable) %>%
  layout(title = "<b> Spaghetti Plot for log(Personal Income and Dispositions) v.s. PCE </b>", yaxis = list(title = "<i> log(Dollars) <i>"), barmode = "stack", 
         xaxis = list(title = "<i> Time <i>"),
         legend = list(title = list(text = '<i> Dispositions </i>'))) %>% 
  layout(legend = list(orientation = 'h', x = 0, y = -0.2))

fig_2

Model fitting

We hypothesize that the personal consumption expenditure pattern has changed with respect to personal income for pre- v.s. in-pandemic periods.

To test this hypothesis, we will fit a linear model using data from year 1967-2018, the “pre-pandemic” period. We will then use this model to predict for PCE outcomes for a pre-pandemic year, 2019, and a in-pandemic year, 2020, respectively. Then we will compare if there is significant difference between the root mean square errors (RMSE) between the pre-pandemic and in-pandemic periods. If there is significant difference between them, we can conclude that there is enough evidence showing that the personal consumption expenditures patterns has changed during the pandemic.

Fit a MLR model using data from 1967 - 2018 (pre-COVID-19 pandemic)

pce_4718_by_dis <-
  pce_4720 %>%
  filter(!str_detect(time, '2019|2020')) %>%
  pivot_wider(
    names_from = variable,
    values_from = millions_of_dollars
  ) %>%
  janitor::clean_names()

test1 <- lm(personal_consumption_expenditures ~ wages_and_salaries + social_security + supplements_to_wages_and_salaries + personal_current_taxes + medicare, data = pce_4718_by_dis)

test1 %>% broom::tidy() %>% knitr::kable(digits = 3, caption = "MLR model PCE v.s. income sources")

MLR model PCE v.s. income sources
term	estimate	std.error	statistic	p.value
(Intercept)	21183.961	15610.049	1.357	0.176
wages_and_salaries	0.878	0.052	16.757	0.000
social_security	-0.456	0.186	-2.452	0.015
supplements_to_wages_and_salaries	1.966	0.172	11.449	0.000
personal_current_taxes	-0.176	0.070	-2.526	0.012
medicare	4.030	0.154	26.231	0.000

The adjusted $R^2$ value for the MLR model is 0.9998252, meaning that the our model ‘fits’ the data well.

To further improve the model, we want to determine whether we should transform our variables for a better fit result or not. To check this, we will use Box-Cox transformation:

plot(test1)

MASS::boxcox(test1)

The QQ-plot indicates the distribution of residuals are normal. The Scale-location plot shows that the residuals are spread equally along the range of predictors. And the residuals-leverage plot shows that no outlier is influential.

Therefore, the relationships between income sources and PCE are best approximated by linear function. We can use MLR model in the following analysis.

Prediction

We will predict the PCE for year 2019 and year 2020 using our MLR model:

pce_2019 <-
  pce_4720 %>%
  filter(
    str_detect(time, '2019')
  ) %>%
  pivot_wider(
    names_from = variable,
    values_from = millions_of_dollars
  ) %>%
  janitor::clean_names()
  
pce_2019_pred <- predict(test1,pce_2019)

#RMSQ
rmse19 <- rmse(test1, pce_2019)

pce_2020 <-
  pce_4720 %>%
  filter(
    str_detect(time, '2020')
  ) %>%
  pivot_wider(
    names_from = variable,
    values_from = millions_of_dollars
  ) %>%
  janitor::clean_names()
  
pce_2020_pred <- predict(test1,pce_2020)

#RMSQ
rmse20 <- rmse(test1, pce_2020)

RMSE of year 2019 prediction, in dollars: 1.4277563^{5}
RMSE of year 2020 prediction, in dollars: 7.7584642^{5}

Testing on the difference

Hypothesis: there is significant difference between our model prediction of PCEs for pre- and in-pandemic periods (year 2019, year 2020)

We need to perform a paired t-test to evaluate if there is difference between the two RMSE. To do this, we need to bootstrap a set of 1000 samples from each year first.

# Bootstrapping
pce_4720_raw <- read_excel("data/pce_1947_2020.xlsx", sheet = 'T20100-Q', 
                      range = 'A8:KM54') %>% 
  janitor::clean_names() %>%
  select(x2, x1967q1:x2020q4) %>%
  drop_na() %>%
  # We only want to look at several variables:
  filter(
      x2 == 'Wages and salaries' |
      x2 == 'Supplements to wages and salaries' |
      x2 == 'Social security' |
      x2 == 'Medicare' |
      x2 == 'Less: Personal current taxes' |
      x2 == 'Personal consumption expenditures'
  ) %>%
  rename(variable = x2)

rownames(pce_4720_raw) <- pull(pce_4720_raw, variable)

pce_4720_raw <- mutate(pce_4720_raw, variable = NULL)

pce_4720_wide <- as.data.frame(t(as.matrix(pce_4720_raw))) %>%
  janitor::clean_names() %>%
  rename(personal_current_taxes = less_personal_current_taxes)

# Select sample dataset
pce_2019_df <- subset(pce_4720_wide, str_detect(rownames(pce_4720_wide), '2019'))
pce_2020_df <- subset(pce_4720_wide, str_detect(rownames(pce_4720_wide), '2020'))

boot_straps_19 = 
  pce_2019_df %>% 
  modelr::bootstrap(n = 1000)

boot_straps_20 = 
  pce_2020_df %>% 
  modelr::bootstrap(n = 1000)

boot_straps_19$strap[[1]] %>% 
  knitr::kable(caption = "A bootstrap example for year 2019") %>%
  kableExtra::kable_styling(latex_options="scale_down")

A bootstrap example for year 2019
	wages_and_salaries	supplements_to_wages_and_salaries	social_security	medicare	personal_current_taxes	personal_consumption_expenditures
x2019q1	9228655	2106633	1018942	767353	2170693	14276587
x2019q4	9422482	2142356	1043049	797912	2221193	14759183
x2019q3	9311314	2126644	1034273	789893	2197141	14645313
x2019q1.1	9228655	2106633	1018942	767353	2170693	14276587

boot_straps_20$strap[[1]] %>% 
  knitr::kable(caption = "A bootstrap example for year 2020") %>%
  kableExtra::kable_styling(latex_options="scale_down")

A bootstrap example for year 2020
	wages_and_salaries	supplements_to_wages_and_salaries	social_security	medicare	personal_current_taxes	personal_consumption_expenditures
x2020q2	8908788	2040680	1075420	824058	2096512	13097348
x2020q4	9546037	2158109	1089577	860594	2242281	14537004
x2020q1	9526063	2148335	1068472	804655	2252395	14545460
x2020q2.1	8908788	2040680	1075420	824058	2096512	13097348

Then we can compute the RMSE and perform a t-test:

#function for RMSE
rmse_compute <- function(data){
  data <- as.data.frame(data)
  return(rmse(test1, data))
}

# compute RMSE
rmse_19 <- boot_straps_19 %>%
  mutate(
    rmse = map_dbl(strap, rmse_compute)
  ) %>%
  select(rmse)
  
rmse_20 <- boot_straps_20 %>%
  mutate(
    rmse = map_dbl(strap, rmse_compute)
  ) %>%
  select(rmse)  
# Perform test
t.test(rmse_19,rmse_20) %>% broom::tidy() %>% knitr::kable(caption = "T-Test Table: Welch Two Sample t-test") %>% kableExtra::kable_styling(latex_options="scale_down")

T-Test Table: Welch Two Sample t-test
estimate	estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-619703.1	139800	759503	-112.8252	0	1093.99	-630480.3	-608925.8	Welch Two Sample t-test	two.sided

We reject the null hypothesis and conclude that there is significant difference between our model prediction of PCE for year 2019 and year 2020. This indicates that people’s spending pattern has changed during the pandemic, and this change cannot fully explain by variations of their incomes.

Therefore, there is evidence showing that the COVID-19 pandemic has affected people’s spending patterns.