# Using the CVEK R package

## Short Description

Using a library of base kernels, CVEK learns the generating function from data by directly minimizing the ensemble model’s error, and tests whether the data is generated by the RKHS under the null hypothesis. Part I presents a simple example to conduct Gaussian process regression and hypothesis testing using cvek function on simulated data. Part II shows a real-world application where we use CVEK to understand whether the per capita crime rate impacts the relationship between the local socioeconomic status and the housing price at Boston, MA, U.S.A. Finally, Part III provides code showing how to visualize the interaction effect from a cvek model.

library(CVEK)
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 3.6.2
library(ggrepel)

## Tutorial using simulated dataset

### Generate Data and Define Model

We generate a simulated dataset using the $$"linear"$$ kernel, and set the relative interaction strength to be $$0.2$$. The outcome $$y_i$$ is generated as, \begin{align*} y_i=h_1(\mathbf{x}_{i, 1})+h_2(\mathbf{x}_{i, 2})+0.2 * h_{12}(\mathbf{x}_{i, 1}, \mathbf{x}_{i, 2})+\epsilon_i, \end{align*} where $$h_1$$, $$h_2$$, $$h_{12}$$ are sampled from RKHSs $$\textit{H}_1$$, $$\textit{H}_2$$, $$\textit{H}_{12}$$, generated using the corresponding linear kernel. We standardize all sampled functions to have unit form, so that $$0.2$$ represents the strength of interaction relative to the main effect.

set.seed(0726)
n <- 60 # including training and test
d <- 4
int_effect <- 0.2
data <- matrix(rnorm(n * d), ncol = d)
Z1 <- data[, 1:2]
Z2 <- data[, 3:4]

kern <- generate_kernel(method = "linear")
w <- rnorm(n)
w12 <- rnorm(n)
K1 <- kern(Z1, Z1)
K2 <- kern(Z2, Z2)
K1 <- K1 / sum(diag(K1)) # standardize kernel
K2 <- K2 / sum(diag(K2))
h0 <- K1 %*% w + K2 %*% w
h0 <- h0 / sqrt(sum(h0 ^ 2)) # standardize main effect

h1_prime <- (K1 * K2) %*% w12 # interaction effect

# standardize sampled functions to have unit norm, so that 0.2
# represents the interaction strength relative to main effect
Ks <- svd(K1 + K2)
len <- length(Ks$d[Ks$d / sum(Ks$d) > .001]) U0 <- Ks$u[, 1:len]
h1_prime_hat <- fitted(lm(h1_prime ~ U0))
h1 <- h1_prime - h1_prime_hat

h1 <- h1 / sqrt(sum(h1 ^ 2)) # standardize interaction effect
Y <- h0 + int_effect * h1 + rnorm(1) + rnorm(n, 0, 0.01)
data <- as.data.frame(cbind(Y, Z1, Z2))
colnames(data) <- c("y", paste0("z", 1:d))

data_train <- data[1:40, ]
data_test <- data[41:60, ]

The resulting data look as follows.

knitr::kable(head(data_train, 5))
y z1 z2 z3 z4
1.206494 -0.3540220 -0.8477765 -1.9983111 1.3628378
1.595668 -1.3521966 0.9002141 1.0221309 -0.7187911
1.469933 0.5276426 -0.8567797 0.0372108 0.4386086
1.593581 -1.0576862 0.7018883 0.9085519 -0.8034505
1.363133 0.9926991 0.7144279 -0.9475966 -0.2037185

Now we can apply the cvek function to conduct Gaussian process regression. Below table is a detailed list of all the arguments of the function cvek.

knitr::include_graphics("table1.pdf", auto_pdf = TRUE)

Suppose we want our kernel library to contain three kernels: $$"linear"$$, $$"polynomial"$$ with $$p=2$$, and $$"rbf"$$ with $$l=1$$ (the effective parameter for $$"polynomial"$$ is $$p$$ and the effective parameter for $$"rbf"$$ is $$l$$, so we can set anything to $$l$$ for $$"polynomial"$$ kernel and $$p$$ for $$"rbf"$$ kernel). We then first apply define_library.

kern_par <- data.frame(method = c("linear", "polynomial", "rbf"),
l = rep(1, 3), p = 1:3, stringsAsFactors = FALSE)
# define kernel library
kern_func_list <- define_library(kern_par)

The null model is then $$y \sim z1 + z2 + k(z3, z4)$$.

formula <- y ~ z1 + z2 + k(z3, z4)

### Estimation and Testing

With all these parameters specified, we can conduct Gaussian process regression.

est_res <- cvek(formula, kern_func_list = kern_func_list,
data = data_train)
est_res$lambda #> [1] 4.539993e-05 est_res$u_hat
#> [1] 0.994864707 0.000000000 0.005135293

We can see that the ensemble weight assigns $$0.99$$ to the $$"linear"$$ kernel, which is the true kernel. This illustrates the accuracy and efficiency of the CVEK method.

We next specify the testing procedure. Note that we can use the same function cvek to perform hypothesis testing, as we did for estimation, but we need to provide $$formula\_test$$, which is the user-supplied formula indicating the additional alternative effect (e.g., interactions) to test for. Specifically, we will first show how to conduct the classic score test by specifying $$test="asymp"$$, followed by a bootstrap test where we specify $$test="boot"$$, and the number of bootstrap samples $$B=200$$.

formula_test <- y ~ k(z1, z2):k(z3, z4)

cvek(formula, kern_func_list = kern_func_list,
data = data_train, formula_test = formula_test,
mode = "loocv", strategy = "stack",
beta_exp = 1, lambda = exp(seq(-10, 5)),
test = "asymp", alt_kernel_type = "ensemble",
verbose = FALSE)$pvalue #> [,1] #> [1,] 1.493613e-08 cvek(formula, kern_func_list = kern_func_list, data = data_train, formula_test = formula_test, mode = "loocv", strategy = "stack", beta_exp = 1, lambda = exp(seq(-10, 5)), test = "boot", alt_kernel_type = "ensemble", B = 200, verbose = FALSE)$pvalue
#> [1] 0

Both tests come to the same conclusion. At the significance level $$0.05$$, we reject the null hypothesis that there’s no interaction effect, which matches our data generation mechanism. Additionally, we can prediction new outcomes based on estimation result est_res.

y_pred <- predict(est_res, data_test[, 2:5])
data_test_pred <- cbind(y_pred, data_test)
y_pred y z1 z2 z3 z4
41 1.459658 1.455218 0.4551541 0.8838175 -0.1064603 -0.7294840
42 1.522598 1.592729 -0.3551146 -0.7374137 -1.0941214 -2.1262211
43 1.499522 1.519744 -2.9798328 -1.0640729 -0.2713319 -0.2268063
44 1.493907 1.517606 0.0145789 -0.2485406 -0.4154637 -0.9278958
45 1.486986 1.530861 -0.2603291 -1.0785118 -0.8478335 -0.8378207

## Detecting Nonlinear Interaction in Boston Housing Price

In this part, we show an example of using cvek test to detect nonlinear interactions between socioeconomic factors that contribute to housing price in the city of Boston, Massachusetts, USA. We consider the Boston dataset (available in the MASS package), which is collected by the U.S Census Service about the median housing price ($$medv$$) in Boston, along with additional variables describing local socioeconomic information such as per capita crime rate, proportion of non-retail business, number of rooms per household, etc. Below table lists the $$14$$ variables.

knitr::include_graphics("table2.pdf", auto_pdf = TRUE)

Here we use cvek to study whether the per capita crime rate ($$crim$$) impacts the relationship between the local socioeconomic status ($$lstat$$) and the housing price. The null model is, \begin{align*} medv \sim \mathbf{x}^\top \boldsymbol{\beta} + k(crim) + k(lstat), \end{align*} where $$\mathbf{x}^\top=(1, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black)$$, and $$k()$$ is specified as a semi-parametric model with a model library that includes linear and rbf kernels with $$l=1$$. This inclusion of nonlinearity (i.e., the rbf kernel) is important, since per classic results in the macroeconmics literature, the crime rates and socioeconomic status of local community are known to have nonlinear association with the local housing price harrison_hedonic_1978.

kern_par <- data.frame(method = c("linear", "rbf"),
l = rep(1, 2), p = 1:2, stringsAsFactors = FALSE)
# define kernel library
kern_func_list <- define_library(kern_par)

To this end, the hypothesis regarding whether the crime rate ($$crim$$) impacts the association between local socioeconomic status ($$lstat$$) and the housing price ($$medv$$) is equivalent to testing whether there exists a nonlinear interaction between $$crim$$ and $$lstat$$ in predicting $$medv$$, i.e., \begin{align*} \mathcal{H}_0: &\; medv \sim \mathbf{x}^\top \boldsymbol{\beta} + k(crim) + k(lstat), \\ \mathcal{H}_a: &\; medv \sim \mathbf{x}^\top \boldsymbol{\beta} + k(crim) + k(lstat) + k(crim):k(lstat). \end{align*}

To test this hypothesis using cvek, we specify the null model using formula, and specify the additional interaction term ($$k(crim):k(lstat)$$) in the alternative model using formula_test, as shown below:

formula <- medv ~ zn + indus + chas + nox + rm + age + dis +
rad + tax + ptratio + black + k(crim) + k(lstat)
formula_test <- medv ~ k(crim):k(lstat)
fit_bos <- cvek(formula, kern_func_list = kern_func_list, data = Boston,
formula_test = formula_test,
lambda = exp(seq(-3, 5)), test = "asymp")

Given the fitted object (fit_bos), the p-value of the cvek test can be extracted as below:

fit_bos$pvalue #> [,1] #> [1,] 4.614106e-06 Since $$p<0.05$$, we reject the null hypothesis that there’s no $$crim:lstat$$ interaction, and conclude that the data does suggest an impact of the crime rate on the relationship between the local socioeconomic status and the housing price. ## Visualizing Nonlinear Interaction in Boston Housing Price A versatile and sometimes the most intepretable method for understanding interaction effects is via plotting. Next we show the Boston example of how to visualize the fitted interaction from a cvek model. We visualize the interaction effects by creating five datasets: Fix all confounding variables to their means, vary $$lstat$$ in a reasonable range (i.e., from $$12.5$$ to $$17.5$$, since the original range of $$lstat$$ in Boston dataset is $$(1.73, 37.97)$$), and respectively set $$crim$$ value to its $$5\%, 25\%, 50\%, 75\%$$ and $$95\%$$ quantiles. # first fit the alternative model formula_alt <- medv ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + k(crim):k(lstat) fit_bos_alt <- cvek(formula = formula_alt, kern_func_list = kern_func_list, data = Boston, lambda = exp(seq(-3, 5))) # mean-center all confounding variables not involved in the interaction # so that the predicted values are more easily interpreted pred_name <- c("zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black") covar_mean <- apply(Boston, 2, mean) pred_cov <- covar_mean[pred_name] pred_cov_df <- t(as.data.frame(pred_cov)) lstat_list <- seq(12.5, 17.5, length.out = 100) crim_quantiles <- quantile(Boston$crim, probs = c(.05, .25, .5, .75, .95))

# crim is set to its 5% quantile
data_test1 <- data.frame(pred_cov_df, lstat = lstat_list,
crim = crim_quantiles[1])
#> Warning in data.frame(pred_cov_df, lstat = lstat_list, crim =
#> crim_quantiles[1]): row names were found from a short variable and have been
data_test1_pred <- predict(fit_bos_alt, data_test1)

# crim is set to its 25% quantile
data_test2 <- data.frame(pred_cov_df, lstat = lstat_list,
crim = crim_quantiles[2])
#> Warning in data.frame(pred_cov_df, lstat = lstat_list, crim =
#> crim_quantiles[2]): row names were found from a short variable and have been
data_test2_pred <- predict(fit_bos_alt, data_test2)

# crim is set to its 50% quantile
data_test3 <- data.frame(pred_cov_df, lstat = lstat_list,
crim = crim_quantiles[3])
#> Warning in data.frame(pred_cov_df, lstat = lstat_list, crim =
#> crim_quantiles[3]): row names were found from a short variable and have been
data_test3_pred <- predict(fit_bos_alt, data_test3)

# crim is set to its 75% quantile
data_test4 <- data.frame(pred_cov_df, lstat = lstat_list,
crim = crim_quantiles[4])
#> Warning in data.frame(pred_cov_df, lstat = lstat_list, crim =
#> crim_quantiles[4]): row names were found from a short variable and have been
data_test4_pred <- predict(fit_bos_alt, data_test4)

# crim is set to its 95% quantile
data_test5 <- data.frame(pred_cov_df, lstat = lstat_list,
crim = crim_quantiles[5])
#> Warning in data.frame(pred_cov_df, lstat = lstat_list, crim =
#> crim_quantiles[5]): row names were found from a short variable and have been
data_test5_pred <- predict(fit_bos_alt, data_test5)

# combine five sets of prediction data together
medv <- rbind(data_test1_pred, data_test2_pred, data_test3_pred,
data_test4_pred, data_test5_pred)
data_pred <- data.frame(lstat = rep(lstat_list, 5), medv = medv,
crim = rep(c("5% quantile", "25% quantile",
"50% quantile", "75% quantile",
"95% quantile"), each = 100))
data_pred$crim <- factor(data_pred$crim,
levels = c("5% quantile", "25% quantile",
"50% quantile", "75% quantile",
"95% quantile"))

data_label <- data_pred[which(data_pred$lstat == 17.5), ] data_label$value <- c("0.028%", "0.082%", "0.257%", "3.677%", "15.789%")
data_label$value <- factor(data_label$value, levels =
c("0.028%", "0.082%", "0.257%",
"3.677%", "15.789%"))

ggplot(data = data_pred, aes(x = lstat, y = medv, color = crim)) +
geom_point(size = 0.1) +
geom_text_repel(aes(label = value), data = data_label,
color = "black", size = 3.6) +
scale_colour_manual(values = c("firebrick1", "chocolate2",
"darkolivegreen3", "skyblue2",
"purple2")) +
geom_line() + theme_set(theme_bw()) +
theme(panel.grid = element_blank(),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 12)) +
labs(x = "percentage of lower status",
y = "median value of owner-occupied homes (\$1000)",
col = "per capita crime rate")