Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Introduction to adaptMT package

Lihua Lei

2018-07-30

Introduction

adaptMT package is an R implementation of Adaptive P-value Thresholding (Lei and Fithian 2016). adaptMT package has two main components: a generic interface that allows users to specify the working model and algorithms to fit them, as well as a pool of easy-to-use end-to-end wrappers. The former is captured by function adapt. The latter includes adapt_glm, adapt_gam and adapt_glmnet for using GLM, GAM and L1-regularized GLM.

# install the "adaptMT" package from github.
# will be submitted to CRAN very soon.
# devtools::install_github("lihualei71/adaptMT")
library("adaptMT")

AdaPT is a flexible iterative protocol that allows almost arbitrary exploration of data while still guaranteeing the False Discovery Rate (FDR) control in finite samples under standard assumptions in literature. The pipeline of AdaPT is as follows:

The three red components allow unlimited user-flexibility, as the theoretical results in (Lei and Fithian 2016) guarantees the FDR control regardless of the implementation. To get rid of over-complication and minimize the user inputs, adaptMT package implements a slightly restrictive , though still flexible enough, pipeline which

We refer the readers to (Lei and Fithian 2016) for more details.

Convenient Wrappers in adaptMT Package

adaptMT package provides three convenient wrappers adapt_glm, adapt_gam and adapt_glmnet, which generate the results in Section 5 of (Lei and Fithian 2016). In later versions we will add adapt_gbm (generalized boosting machine), adapt_rf (random forest), etc.. We recommend reading our paper first to familiarize with the basic concepts and scanning the documentation of them; See ?adapt_glm, ?adapt_gam, ?adapt_glmnet.

Using adapt_glm

We illustrate one of the main function adapt_glm, for AdaPT with logistic-Gamma GLM as the working model, on estrogen dataset, a gene/drug response dataset from NCBI Gene Expression Omnibus (GEO). estrogen dataset consists of gene expression measurements for \(n = 22283\) genes, in response to estrogen treatments in breast cancer cells for five groups of patients, with different dosage levels and 5 trials in each. The task is to identify the genes responding to a low dosage. The p-values pi for gene i is obtained by a one-sided permutation test which evaluates evidence for a change in gene expression level between the control group (placebo) and the low-dose group. \(\{p_i : i \in [n]\}\) are then ordered according to permutation t-statistics comparing the control and low-dose data, pooled, against data from a higher dosage (with genes that appear to have a strong response at higher dosages placed earlier in the list). The code to compute p-values and the ordering can be found in Rina Barber’s website.

Here we subsample the top 5000 genes for illustration. The results for the full dataset are given in Section 5.1 of (Lei and Fithian 2016).

# load the data.
data("estrogen")
# Take the first 5000 genes
if (!requireNamespace("dplyr")){
  install.packages("dplyr")
}
#> Loading required namespace: dplyr
library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
estrogen <- select(estrogen, pvals, ord_high) %>% 
  filter(ord_high <= 5000)
rownames(estrogen) <- NULL
head(estrogen, 5)
#>        pvals ord_high
#> 1 0.05011062      366
#> 2 0.71404053     4772
#> 3 0.06675970     3562
#> 4 0.40392007      790
#> 5 0.40614415     2291
summary(estrogen)
#>      pvals             ord_high   
#>  Min.   :0.000011   Min.   :   1  
#>  1st Qu.:0.076082   1st Qu.:1251  
#>  Median :0.238279   Median :2500  
#>  Mean   :0.315094   Mean   :2500  
#>  3rd Qu.:0.501009   3rd Qu.:3750  
#>  Max.   :0.999289   Max.   :5000

Now we run adapt_glm on this dataset. adapt_glm takes a conditional logistic-Gamma GLM as the working model by default. Specifically, it models the p-values as \[H_i \mid x_i \sim \pi(x_i), \quad \mathrm{logit}(\pi(x_i))= \phi(x_i)^{T}\beta\] \[-\log p_i \mid H_i, x_i\sim \left\{\begin{array}{ll} \mathrm{Exp(1)} & H_i = 0\\ \mathrm{Exp(\mu(x))} & H_i = 1\end{array}\right., \quad \frac{1}{\mu(x_i)} = \phi(x_i)^{T}\gamma\] where \(\phi(x)\) is a featurization of \(x\). In this example, we use the spline bases, given by ns function from splines package. For illustration, we choose our candidate models as the above GLMs with \(\phi(x)\) being the spline bases with equal-spaced knots and the number of knots ranging from 6-10. We use BIC to select the best model at the initial stage and use the selected model for the following model fitting.

# prepare the inputs of AdaPT
# need "splines" package to construct the formula for glm
library("splines")
pvals <- as.numeric(estrogen$pvals)
x <- data.frame(x = as.numeric(estrogen$ord))
formulas <- paste0("ns(x, df = ", 6:10, ")")
formulas
#> [1] "ns(x, df = 6)"  "ns(x, df = 7)"  "ns(x, df = 8)"  "ns(x, df = 9)" 
#> [5] "ns(x, df = 10)"

adapt_glm function provides several user-friendly tools to monitor the progress. For model selection, a progress bar will, by default, be shown in the console that indicates how much proportion of models have been fitted. This can be turned off by setting verbose$ms = FALSE. Similarly for model fitting, a progress bar can be shown in the console, though not by default, by setting verbose$fit = TRUE. Also, by default, the progress of the main process will be shown in the console that indicates (1) which target FDR level has been achieved; (2) FDPhat for each target FDR level; (3) number of rejections for each target FDR level.

# run AdaPT
res_glm <- adapt_glm(x = x, pvals = pvals, pi_formulas = formulas, mu_formulas = formulas)
#> Model selection starts!
#> Shrink the set of candidate models if it is too time-consuming.
#> 
  |                                                        
  |                                                  |   0%
  |                                                        
  |==========                                        |  20%
  |                                                        
  |====================                              |  40%
  |                                                        
  |==============================                    |  60%
  |                                                        
  |========================================          |  80%
  |                                                        
  |==================================================| 100%
#> alpha = 0.29: FDPhat 0.2899, Number of Rej. 3474
#> alpha = 0.28: FDPhat 0.28, Number of Rej. 3325
#> alpha = 0.27: FDPhat 0.2698, Number of Rej. 3084
#> alpha = 0.26: FDPhat 0.2598, Number of Rej. 2925
#> alpha = 0.25: FDPhat 0.2498, Number of Rej. 2882
#> alpha = 0.24: FDPhat 0.2399, Number of Rej. 2834
#> alpha = 0.23: FDPhat 0.23, Number of Rej. 2761
#> alpha = 0.22: FDPhat 0.2198, Number of Rej. 2703
#> alpha = 0.21: FDPhat 0.2096, Number of Rej. 2595
#> alpha = 0.2: FDPhat 0.1998, Number of Rej. 2547
#> alpha = 0.19: FDPhat 0.1897, Number of Rej. 2398
#> alpha = 0.18: FDPhat 0.1797, Number of Rej. 2287
#> alpha = 0.17: FDPhat 0.1699, Number of Rej. 2231
#> alpha = 0.16: FDPhat 0.1599, Number of Rej. 2139
#> alpha = 0.15: FDPhat 0.15, Number of Rej. 2060
#> alpha = 0.14: FDPhat 0.1399, Number of Rej. 1966
#> alpha = 0.13: FDPhat 0.1297, Number of Rej. 1889
#> alpha = 0.12: FDPhat 0.1195, Number of Rej. 1824
#> alpha = 0.11: FDPhat 0.1097, Number of Rej. 1723
#> alpha = 0.1: FDPhat 0.0999, Number of Rej. 1582
#> alpha = 0.09: FDPhat 0.0895, Number of Rej. 1430
#> alpha = 0.08: FDPhat 0.0798, Number of Rej. 1240
#> alpha = 0.07: FDPhat 0.0697, Number of Rej. 1105
#> alpha = 0.06: FDPhat 0.0594, Number of Rej. 960
#> alpha = 0.05: FDPhat 0.0495, Number of Rej. 909
#> alpha = 0.04: FDPhat 0.0389, Number of Rej. 796
#> alpha = 0.03: FDPhat 0.0297, Number of Rej. 539
#> alpha = 0.02: FDPhat 0.0179, Number of Rej. 224

plot_1d_thresh gives the plot for the rejection threshold as a function of x (must be univariate without repeated value) for given \(\alpha\). We display the plots for \(\alpha \in \{0.3, 0.25, 0.2, 0.15, 0.1, 0.05\}\).

plot_1d_lfdr gives the plot for the estimated local FDR as a function of x (must be univariate without repeated value) for given \(\alpha\). We display the plots for \(\alpha \in \{0.3, 0.25, 0.2, 0.15, 0.1, 0.05\}\). It is clear that the estimated local FDR almost remains the same, indicating that the information loss caused by partial masking is small.

par(mfrow = c(2, 3))
for (alpha in seq(0.3, 0.05, -0.05)){
  nrejs <- res_glm$nrejs[floor(alpha * 100)]
  title <- paste0("alpha = ", alpha, ", nrejs = ", nrejs)
  plot_1d_lfdr(res_glm, x, pvals, alpha, title, disp_ymax = 0.25, xlab = "order")
}

Using adapt_gam

adapt_gam has the exactly the same setting as adapt_glm except that a generalized additive model (GAM) is fitted instead of a generalized linear model. We refer the readers to (Wood 2006) for more details on GAMs.

We illustrate adapt_gam using a simulation dataset, under similar settings in Section 5.2 of (Lei and Fithian 2016). In this case, \(x_{i}\) is a two dimensional vector generated from an equi-spaced \(20\times 20\) grid in the area \([-100, 100]\times [-100, 100]\). The p-values are generated from a one-sided normal test, i.e. \[p_i = 1 - \Phi(z_i), \quad z_i\sim N(\mu_i, 1), \quad \mu_i = \left\{\begin{array}{cc}0 & (i\in \mathcal{H}_{0})\\ 2 & (i\not\in \mathcal{H}_{0})\end{array}\right.\] We set the hypotheses in a circle in the center of the grid with radius 40. The following code generates the p-values:

# Generate a 2-dim x
n <- 400
x1 <- x2 <- seq(-100, 100, length.out = 20)
x <- expand.grid(x1, x2)
colnames(x) <- c("x1", "x2")

# Generate p-values (one-sided z test)
H0 <- apply(x, 1, function(coord){sum(coord^2) < 40^2})
mu <- ifelse(H0, 2, 0)
set.seed(0)
zvals <- rnorm(n) + mu
pvals <- 1 - pnorm(zvals)

Now we apply adapt_gam on this dataset. adapt_gam is built on mgcv package which provides a more advanced version of gam in stats package. Users need to install mgcv package first.

# install.packages("mgcv")
library("mgcv")
#> Loading required package: nlme
#> 
#> Attaching package: 'nlme'
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse
#> This is mgcv 1.8-24. For overview type 'help("mgcv-package")'.

By default, adapt_gam takes a logistic-Gamma GAM as the working model and the only modelling input is the (list of) partial formula(s) as specified in adapt_glm. Here we take advantage of automatic knots selection of function s, based on generalized cross-validation criterion, in mgcv function to get rid of manual model selection; See ?s for details. In particular, we consider a two-dimensional spline basis for \(x_1\) and \(x_2\).

formula <- "s(x1, x2)"
res_gam <- adapt_gam(x = x, pvals = pvals, pi_formulas = formula, mu_formulas = formula)
#> alpha = 0.83: FDPhat 0.8299, Number of Rej. 194
#> alpha = 0.82: FDPhat 0.8196, Number of Rej. 194
#> alpha = 0.81: FDPhat 0.8093, Number of Rej. 194
#> alpha = 0.8: FDPhat 0.7979, Number of Rej. 193
#> alpha = 0.79: FDPhat 0.7895, Number of Rej. 190
#> alpha = 0.78: FDPhat 0.7789, Number of Rej. 190
#> alpha = 0.77: FDPhat 0.7684, Number of Rej. 190
#> alpha = 0.76: FDPhat 0.7579, Number of Rej. 190
#> alpha = 0.75: FDPhat 0.75, Number of Rej. 172
#> alpha = 0.74: FDPhat 0.7349, Number of Rej. 166
#> alpha = 0.73: FDPhat 0.7273, Number of Rej. 165
#> alpha = 0.72: FDPhat 0.7195, Number of Rej. 164
#> alpha = 0.71: FDPhat 0.7055, Number of Rej. 163
#> alpha = 0.7: FDPhat 0.6987, Number of Rej. 156
#> alpha = 0.69: FDPhat 0.6842, Number of Rej. 152
#> alpha = 0.68: FDPhat 0.6776, Number of Rej. 152
#> alpha = 0.67: FDPhat 0.6667, Number of Rej. 150
#> alpha = 0.66: FDPhat 0.66, Number of Rej. 150
#> alpha = 0.65: FDPhat 0.6467, Number of Rej. 150
#> alpha = 0.64: FDPhat 0.64, Number of Rej. 150
#> alpha = 0.63: FDPhat 0.627, Number of Rej. 126
#> alpha = 0.62: FDPhat 0.6129, Number of Rej. 124
#> alpha = 0.61: FDPhat 0.6048, Number of Rej. 124
#> alpha = 0.6: FDPhat 0.5968, Number of Rej. 124
#> alpha = 0.59: FDPhat 0.5854, Number of Rej. 123
#> alpha = 0.58: FDPhat 0.5772, Number of Rej. 123
#> alpha = 0.57: FDPhat 0.5691, Number of Rej. 123
#> alpha = 0.56: FDPhat 0.5574, Number of Rej. 122
#> alpha = 0.55: FDPhat 0.5492, Number of Rej. 122
#> alpha = 0.54: FDPhat 0.5372, Number of Rej. 121
#> alpha = 0.53: FDPhat 0.5254, Number of Rej. 118
#> alpha = 0.52: FDPhat 0.5133, Number of Rej. 113
#> alpha = 0.51: FDPhat 0.5046, Number of Rej. 109
#> alpha = 0.5: FDPhat 0.4948, Number of Rej. 97
#> alpha = 0.49: FDPhat 0.4845, Number of Rej. 97
#> alpha = 0.48: FDPhat 0.4787, Number of Rej. 94
#> alpha = 0.47: FDPhat 0.4674, Number of Rej. 92
#> alpha = 0.46: FDPhat 0.4565, Number of Rej. 92
#> alpha = 0.45: FDPhat 0.4396, Number of Rej. 91
#> alpha = 0.44: FDPhat 0.4396, Number of Rej. 91
#> alpha = 0.43: FDPhat 0.4286, Number of Rej. 91
#> alpha = 0.42: FDPhat 0.4176, Number of Rej. 91
#> alpha = 0.41: FDPhat 0.4066, Number of Rej. 91
#> alpha = 0.4: FDPhat 0.3924, Number of Rej. 79
#> alpha = 0.39: FDPhat 0.3846, Number of Rej. 78
#> alpha = 0.38: FDPhat 0.3718, Number of Rej. 78
#> alpha = 0.37: FDPhat 0.3611, Number of Rej. 72
#> alpha = 0.36: FDPhat 0.3472, Number of Rej. 72
#> alpha = 0.35: FDPhat 0.3472, Number of Rej. 72
#> alpha = 0.34: FDPhat 0.338, Number of Rej. 71
#> alpha = 0.33: FDPhat 0.3286, Number of Rej. 70
#> alpha = 0.32: FDPhat 0.3188, Number of Rej. 69
#> alpha = 0.31: FDPhat 0.3043, Number of Rej. 69
#> alpha = 0.3: FDPhat 0.2899, Number of Rej. 69
#> alpha = 0.29: FDPhat 0.2899, Number of Rej. 69
#> alpha = 0.28: FDPhat 0.2754, Number of Rej. 69
#> alpha = 0.27: FDPhat 0.2647, Number of Rej. 68
#> alpha = 0.26: FDPhat 0.2576, Number of Rej. 66
#> alpha = 0.25: FDPhat 0.2462, Number of Rej. 65
#> alpha = 0.24: FDPhat 0.2381, Number of Rej. 63
#> alpha = 0.23: FDPhat 0.2241, Number of Rej. 58
#> alpha = 0.22: FDPhat 0.2069, Number of Rej. 58
#> alpha = 0.21: FDPhat 0.2069, Number of Rej. 58
#> alpha = 0.2: FDPhat 0.193, Number of Rej. 57
#> alpha = 0.19: FDPhat 0.1786, Number of Rej. 56
#> alpha = 0.18: FDPhat 0.1786, Number of Rej. 56
#> alpha = 0.17: FDPhat 0.1667, Number of Rej. 54
#> alpha = 0.16: FDPhat 0.1481, Number of Rej. 54
#> alpha = 0.15: FDPhat 0.1481, Number of Rej. 54
#> alpha = 0.14: FDPhat 0.1296, Number of Rej. 54
#> alpha = 0.13: FDPhat 0.1296, Number of Rej. 54
#> alpha = 0.12: FDPhat 0.1154, Number of Rej. 52
#> alpha = 0.11: FDPhat 0.098, Number of Rej. 51
#> alpha = 0.1: FDPhat 0.098, Number of Rej. 51
#> alpha = 0.09: FDPhat 0.0833, Number of Rej. 48
#> alpha = 0.08: FDPhat 0.0789, Number of Rej. 38
#> alpha = 0.07: FDPhat 0.0571, Number of Rej. 35
#> alpha = 0.06: FDPhat 0.0571, Number of Rej. 35

We visualize the results below. The left figure gives the truth where the darker pixels correspond to non-nulls and the right figure gives the rejection set by adapt_gam for \(\alpha = 0.1\).

par(mfrow = c(1, 2), mar = c(4, 4, 2, 2))
# Truth
plot(x[, 1], x[, 2], type = "p", xlab = "", ylab = "", col = ifelse(H0, "#000000", "#A9A9A9"), cex = 2, pch = 15)
# Rejection set
rej <- pvals <= res_gam$s[, 10]
plot(x[, 1], x[, 2], type = "p", xlab = "", ylab = "", col = ifelse(rej, "#800000", "#FFB6C1"), cex = 2, pch = 15)

Similar to the 1d case, plot_2d_thresh gives the plot for the rejection threshold as a function of x (must be bivariate without repeated value) for given \(\alpha\). In particular we set \(\alpha = 0.1\).

plot_2d_thresh(res_gam, x, pvals, 0.1, "P-Value Threshold", xlab = "", ylab = "")

Unlike the 1d case, visualizing the level surfaces of local FDR estimates is not an easy task due to the extra dimension. plot_2d_lfdr gives the plot for the local FDR estimates when all p-values are equal to a user-specified level targetp for given \(\alpha\).

par(mfrow = c(2, 2), mar = c(4, 4, 2, 2))
for (targetp in c(0.001, 0.005, 0.01, 0.05)){
  title <- paste0("Local FDR Estimates (p = ", targetp, ")")
  plot_2d_lfdr(res_gam, x, pvals, 0.1, title, targetp, xlab = "", ylab = "")
}

Using adapt_glmnet

adapt_glmnet is appropriate for the cases where the covariates are high-dimensional. Instead of fitting GLMs and GAMs, adapt_glmnet fits a penalized GLM in each step. Unlike adapt_glm and adapt_gam, the covariate \(x\) should be of class matrix, instead of data.frame, and adapt_glmnet always take the full matrix \(x\) as the covariates. As a result, adapt_glmnet has a simpler interface than adapt_glm and adapt_gam in that it does not need inputs like formula. We refer the readers to (Hastie and Qian 2014) for more details on glmnet.

We illustrate adapt_glmnet using a simulation dataset, under similar settings in Section 5.2 of (Lei and Fithian 2016). In this case, we generate 300 i.i.d. \(x_{i}\)’s where each of them is a \(50\)-dimensional vector with each entry generated from an uniform distribution on \([0, 1]\).

set.seed(0)
m <- 50
n <- 500
x <- matrix(runif(n * m), n, m)

We consider the p-values from a logistic Gamma GLM model with \[\pi_{1}(x_i) = \frac{\exp(-3 + 2x_{i1} + 2x_{i2})}{1 + \exp(-3 + 2x_{i1} + 2x_{i2})}, \quad \mu(x_i) = \max\{1, 2x_{i1} + 2x_{i2}\}\]

inv_logit <- function(x) {exp(x) / (1 + exp(x))}
pi <- inv_logit(x[, 1] * 2 + x[, 2] * 2 - 3)
mu <- pmax(1, x[, 1] * 2 + x[, 2] * 2)

Now we generate the null set and the p-values as follows:

H0 <- as.logical(ifelse(runif(n) < pi, 1, 0))
y <- ifelse(H0, rexp(n, 1/mu), rexp(n, 1))
pvals <- exp(-y)

To save computation time, we start adapt_glmnet with \(s_{0} = (0.15, 0.15, \ldots, 0.15)\), instead of the default value \(s_{0} = (0.45, 0.45, \ldots, 0.45)\), and set nfits = 5 to reduce the number of fitting in the process (see ?adapt for more details).

res <- adapt_glmnet(x, pvals, s0 = rep(0.15, n), nfits = 5)
#> alpha = 0.59: FDPhat 0.5847, Number of Rej. 118
#> alpha = 0.58: FDPhat 0.5763, Number of Rej. 118
#> alpha = 0.57: FDPhat 0.5641, Number of Rej. 117
#> alpha = 0.56: FDPhat 0.5556, Number of Rej. 117
#> alpha = 0.55: FDPhat 0.547, Number of Rej. 117
#> alpha = 0.54: FDPhat 0.5345, Number of Rej. 116
#> alpha = 0.53: FDPhat 0.5263, Number of Rej. 114
#> alpha = 0.52: FDPhat 0.5179, Number of Rej. 112
#> alpha = 0.51: FDPhat 0.5, Number of Rej. 96
#> alpha = 0.5: FDPhat 0.5, Number of Rej. 96
#> alpha = 0.49: FDPhat 0.4896, Number of Rej. 96
#> alpha = 0.48: FDPhat 0.4737, Number of Rej. 95
#> alpha = 0.47: FDPhat 0.4632, Number of Rej. 95
#> alpha = 0.46: FDPhat 0.4574, Number of Rej. 94
#> alpha = 0.45: FDPhat 0.4457, Number of Rej. 92
#> alpha = 0.44: FDPhat 0.4396, Number of Rej. 91
#> alpha = 0.43: FDPhat 0.4286, Number of Rej. 91
#> alpha = 0.42: FDPhat 0.4167, Number of Rej. 84
#> alpha = 0.41: FDPhat 0.4074, Number of Rej. 81
#> alpha = 0.4: FDPhat 0.3947, Number of Rej. 76
#> alpha = 0.39: FDPhat 0.3889, Number of Rej. 54
#> alpha = 0.38: FDPhat 0.3774, Number of Rej. 53
#> alpha = 0.37: FDPhat 0.3684, Number of Rej. 38
#> alpha = 0.36: FDPhat 0.3514, Number of Rej. 37
#> alpha = 0.35: FDPhat 0.3125, Number of Rej. 16
#> alpha = 0.34: FDPhat 0.3125, Number of Rej. 16
#> alpha = 0.33: FDPhat 0.3125, Number of Rej. 16
#> alpha = 0.32: FDPhat 0.3125, Number of Rej. 16
#> alpha = 0.31: FDPhat 0.25, Number of Rej. 16
#> alpha = 0.3: FDPhat 0.25, Number of Rej. 16
#> alpha = 0.29: FDPhat 0.25, Number of Rej. 16
#> alpha = 0.28: FDPhat 0.25, Number of Rej. 16
#> alpha = 0.27: FDPhat 0.25, Number of Rej. 16
#> alpha = 0.26: FDPhat 0.25, Number of Rej. 16
#> alpha = 0.25: FDPhat 0.25, Number of Rej. 16
#> alpha = 0.24: FDPhat 0.2, Number of Rej. 15
#> alpha = 0.23: FDPhat 0.2, Number of Rej. 15
#> alpha = 0.22: FDPhat 0.2, Number of Rej. 15
#> alpha = 0.21: FDPhat 0.2, Number of Rej. 15
#> alpha = 0.2: FDPhat 0.2, Number of Rej. 15
#> alpha = 0.19: FDPhat 0.1429, Number of Rej. 14
#> alpha = 0.18: FDPhat 0.1429, Number of Rej. 14
#> alpha = 0.17: FDPhat 0.1429, Number of Rej. 14
#> alpha = 0.16: FDPhat 0.1429, Number of Rej. 14
#> alpha = 0.15: FDPhat 0.1429, Number of Rej. 14
#> alpha = 0.14: FDPhat 0.0769, Number of Rej. 13
#> alpha = 0.13: FDPhat 0.0769, Number of Rej. 13
#> alpha = 0.12: FDPhat 0.0769, Number of Rej. 13
#> alpha = 0.11: FDPhat 0.0769, Number of Rej. 13
#> alpha = 0.1: FDPhat 0.0769, Number of Rej. 13
#> alpha = 0.09: FDPhat 0.0769, Number of Rej. 13
#> alpha = 0.08: FDPhat 0.0769, Number of Rej. 13

Generic Interface of adaptMT Package

The data-adaptive/human-in-the-loop part is embedded in the EM sub-pipeline:

There are two main user-specified options: - exponential family to fit E-steps, characterized by input dist, of type exp_family; - functions to fit \(\pi_{1}(x)\) and \(\mu(x)\), including those for initialization, characterized by input models, of type adapt_model.

We introduce each of them in the following subsections. For a complete list of other inputs and outputs, see ?adapt.

exp_family objects for fitting E-steps

An exp_family object is a list containing all required information for an exponential family, with mean-parameter \(\mu\) and natural parameter \(\eta = \eta(\mu)\) \[h(p; \mu) = \exp\left\{(\eta(\mu) - \eta(\mu^{*}))g(p) - (A(\mu) - A(\mu^{*}))\right\}.\] The above formulation exploits an over-parametrization by introducing \(\mu^{*}\) to guarantee that \(U([0, 1])\) belongs to the family and \(h(p; \mu^{*})\) gives its density. Here by mean-parameter we mean by convention that \[E_{\mu}[g(p)] = \mu.\]

adaptMT package provides a generic function exp_family to construct an exp_family object from scratch. The required arguments include: (1) g as a function; (2) ginv as the inverse function of g; (3) eta as a function; (4) mustar as a numeric; (5) A as a function.

The current version of adaptMT package provides two instances: beta_family() and inv_gaussian_family(). For beta_family(), \[g(p) = -\log(p), \,\,\,\, g^{-1}(x) = e^{-x}, \,\,\,\, \eta(\mu) = -\frac{1}{\mu}, \,\,\,\, \mu^{*} = 1, \,\,\,\, A(\mu) = \log(\mu).\] For inv_gaussian_family(), \[g(p) = \Phi^{-1}(1 - p), \,\,\,\, g^{-1}(x) = 1 - \Phi(x), \,\,\,\, \eta(\mu) = \mu, \,\,\,\, \mu^{*} = 0, \,\,\,\, A(\mu) = \frac{1}{2}\mu^{2}.\] We found that beta_family() consistently yields better results. Even in the case that the p-values are generated from a normal model as inv_gaussian_family() specifies, beta_family() still has comparable performance. For this reason, we take dist = beta_family() as the default input in adapt and all its wrappers.

adaptMT package provides a generic function to generate an ‘exp_family’ object based on these inputs. For instance, the following code generates beta_family(),

# g function
g <- function(x){
    tmp <- -log(x)
    pmax(pmin(tmp, -log(10^-15)), -log(1-10^-15))
}
# inverse function of g
ginv <- function(x){
    tmp <- exp(-x)
    pmax(pmin(tmp, 1 - 10^-15), 10^-15)
}
# eta function (mapping from the mean parameter to the natural parameter)
eta <- function(mu){-1/mu}
# mu* (the mean parameter corresponding to U([0, 1]))
mustar <- 1
# A function (the partition function)
A <- log
# Set a name; optional
name <- "beta"
# Set the default family as Gamma() (for glm, gam, glmnet, etc.); optional
family <- Gamma()
# Generate beta_family()
beta_family <- gen_exp_family(g, ginv, eta, mustar, A, name, family)

adapt_model objects for fitting M-steps

An adapt_model object is a list containing the fitting functions (and their arguments) in M-steps. It has three entries: name, algo and args. name gives the name of the model, which is optional; algo is a list of four functions pifun, mufun, pifun_init and mufun_init, for fitting \(\pi_{1}(x)\) and \(\mu(x)\) and initializing \(\pi_{1}(x)\) and \(\mu(x)\), respectively; args is a list of four lists piargs, muargs, piargs_init and muargs_init that give the other arguments passed to pifun, mufun, pifun_init and mufun_init. adaptMT package provides a generic function that takes the aforementioned elements as input to generate an adapt_model object, that can be passed into adapt or its variants.

Design principles and requirements for pifun and mufun

Let \(P^{(r-1)}[\cdot \mid D_{t}]\) denotes the posterior probability at step \(t\) after \((r-1)\) iteration in the EM algorithm. Here \(D_{t}\) denotes the partially-masked dataset, namely \((x_i, \tilde{p}_{t, i})_{i\in [n]}\). Let \[\hat{H}_{i}^{r} = P^{(r-1)} [H_{i} = 1\mid D_{t}], \quad \hat{b}_{i}^{r} = P^{(r-1)}[p_{i}\le 0.5 \mid D_{t}]\] To fit \(\pi_{1}(x)\) in the \(r\)-th iteration, one can construct a pseudo-dataset with \[y_{\pi}^{r} = (1, \ldots, 1, 0, \ldots, 0)\in R^{2n},\] with weights \[w_{\pi}^{r} = \left(\hat{H}_{1}^{r}, \ldots, \hat{H}_{n}^{r}, 1 - \hat{H}_{1}^{t}, \ldots, 1 - \hat{H}_{n}^{r}\right).\] Let \(xx\) be the covariate matrix after replicated for twice, namely rbind(x, x). Then one can apply any classification algorithm on \((xx, y_{\pi}^{r})\) with weights \(w_{\pi}^{r}\) and estimate \(\hat{\pi}_{1}^{r}(x)\) by the fitted probability of the first \(n\) units.

Similarly, to fit \(\mu(x)\) in the \(r\)-th iteration, one can construct a pseudo-dataset with outcomes \[y_{\mu}^{r} = (p_{1},\ldots, p_{n}, 1-p_{1}, \ldots, 1-p_{n}),\] with weights \[w_{\mu}^{r} = \left(\hat{b}_{1}^{r}\cdot \hat{H}_{1}^{r}, \ldots, \hat{b}_{n}^{r}\cdot \hat{H}_{n}^{r}, (1-\hat{b}_{1}^{r})\cdot \hat{H}_{1}^{r}, \ldots, (1-\hat{b}_{n}^{r})\cdot \hat{H}_{n}^{r}\right).\] Then one can apply any algorithm on \((xx, y_{\mu}^{r})\) with weights \(w_{\mu}^{r}\) and estimate \(\hat{\mu}(x)\) by the fitted values of the first \(n\) units. Note that this is a more general version that Algorithm 2 in our paper. It is straightforward to show that for generalized linear models, this is equivalent to Algorithm 2. Also, as mentioned in Appendix A.3 in the supplementary material, we found that it yields a more stable performance by replacing the weights to \[w_{\mu}^{r} = \left(\hat{b}_{1}^{r}, \ldots, \hat{b}_{n}^{r}, 1-\hat{b}_{1}^{r}, \ldots, 1-\hat{b}_{n}^{r}\right).\]

Note that \(\hat{H}_{i}^{r}\) and \(\hat{b}_{i}^{r}\) are generated within the algorithm so the user do not need to take care of them. Rather, the user only need to specifies the functions pifun and mufun as if \(\hat{H}_{i}^{r}\) and \(\hat{b}_{i}^{r}\) are known. More specifically, both pifun and mufun take \(x\) and some output that has been calculated in the algorithm (\(y_{\pi}^{r}\) or \(y_{\mu}^{r}\)) with some weights that also has been calculated in the algorithm (\(w_{\pi}^{r}\) or \(w_{\mu}^{r}\)). They both output a sequence of estimates (\(\pi_{1}(x)\) or \(\mu(x)\)) of length \(n\), as well as some other information (e.g. degree of freedom, for performing model selection).

In adaptMT package, two types of inputs are accepted for pifun and mufun: list(formula = , data = , weights = , ...) or list(x = , y = , weights = , ...). We refer to them as formula-data input type and x-y input type. Both of them should have ouput in the form of list(fitv = , info = ) where fitv gives the fitted values and info is a list that gives other information, including degree-of-freedom (info$df), variable importance (info$vi), etc.. If model selection is performed, the entry info$df must be given.

formula-data input format

This input type is adapted to a majority of fitting functions in R where the model is specified as a combination of a source data (data) and a formula (formula). In pifun and mufun, the user only needs to provide a partial formula without a response name. For instance, if the covariate x (in the input of adapt) is a data.frame that has a single column named “x” and one wants to fit \(\pi_{1}(x)\) using a logistic-Gamma GLM with 8 knots, the formula should be specified as "ns(x, df = 8)" or "~ ns(x, df = 8)" or ~ ns (x, df = 8). Then adapt automatically construct a data.frame in the form of data.frame(x = xx, ** = …), where ** is a response name that is automatically generated by the algorithm which is guaranteed not to be conflicted with any variable names in x, and … is the outcome \(y_{\pi}^{r}\) or \(y_{\mu}^{r}\). For technical reasons (unpleasant feature of formula objects in R), the input formula in pifun or mufun is always assumed to be a character. However, this technical issue can be simply avoid by adding a line at the beginning of pifun or mufun. See the example below.

For instance, in logistic-Gamma GLM, pifun is a logistic regression (of \(y_{\pi}^{r}\) on \(xx\) with weights \(w_{\pi}^{r}\)) and mufun is a Gamma GLM (or \(y_{\mu}^{r}\) on \(xx\) with weights \(w_{\mu}^{r}\)). The exemplary code is given as follows:

pifun_glm <- function(formula, data, weights, ...){
  formula <- as.formula(formula) # Transform the argument `formula` from character to formula
  fit <- stats::glm(formula, data, weights = weights,
                    family = quasibinomial(), ...) 
  # Here quasibinomial() family gets rid of meaningless warning messages
  fitv <- predict(fit, type = "response") # fitted values for pi1(x)
  info <- list(df = fit$rank) # degree-of-freedom being the rank of the design matrix
  return(list(fitv = fitv, info = info))
}

mufun_glm <- function(formula, data, weights, ...){
  formula <- as.formula(formula) # Transform the argument `formula` from character to formula
  fit <- stats::glm(formula, data, weights = weights,
                family = Gamma(), ...)
  fitv <- predict(fit, type = "response") # fitted values for mu(x)
  info <- list(df = fit$rank) # degree-of-freedom being the rank of the design matrix
  return(list(fitv = fitv, info = info))
}

Note that the above code is a over-simplified version of the implementation in adaptMT package because of the subtlety in fitting GLM using glm function in R stats package with inverse or log link functions (see ?glm and ?family for details), for which the initialization might fail.

x-y input format

This input type is adapted to a fraction of methods, especially for modern machine learning algorithms, where the model is specified by the outcome variable (y) and a covariate matrix (x). For instance, glmnet package falls in this category, though the glmnetUtils package provides a formula interface for it.

To illustrate, we consider the \(L_1\)-penalized logistic-Gamma GLM by fitting the same aforementioned regressions for \(\pi_{1}(x)\) and \(\mu(x)\) but with a \(L_1\) penalty and with the penalty level selected by cross-validation. Note that glmnet package does not support Gamma GLM, we use HDtweedie package instead; See ?HDtweedie for details. The exemplary code is given as follows:

pifun_glmnet <- function(x, y, weights, ...){
  fit <- glmnet::cv.glmnet(x, y, weights = weights, family = "binomial", ...) 
  fitv <- predict(fit, newx = x, s = "lambda.min", type = "response") # fitted values for pi1(x)
  beta <- coef(fit, s = "lambda.min")
  info <- list(df = sum(beta != 0)) # degree-of-freedom being the number of non-zero coefficients
  return(list(fitv = fitv, info = info))
}

mufun_glmnet <- function(x, y, weights, ...){
  fit <- HDtweedie::cv.HDtweedie(x, y, p = 2, weights = weights, ...)
  fitv <- predict(fit, newx = x, s = "lambda.min", type = "response") # fitted values for mu(x)
  beta <- coef(fit, s = "lambda.min")
  info <- list(df = sum(beta != 0)) # degree-of-freedom being the number of non-zero coefficients
  return(list(fitv = fitv, info = info))
}

Again the above code is a over-simplified for the purpose of illustration. We deal with the initialization and other numerical issues much more carefully in the real implementation.

Design principles and requirements for pifun_init and mufun_init

pifun_init and mufun_init give the functions to initialize \(\pi_{1}(x)\) and \(\mu(x)\), respectively. Both of them are required to take at least three inputs: x, pvals and s. They are required output list(fitv = , info = ), with the same meaning as in pifun and mufun.

As described in (Lei and Fithian 2016), the initialization of \(\pi_{1}(x)\) is done by defining \[J_{i} = 1 - \frac{I(s_{0}(x_{i}) \le p_{i}\le 1 - s_{0}(x_{i}))}{1 - 2s_{0}(x_i)}\] and regress \(J_{i}\) on \(x_{i}\) with the same formula as in pifun. For mufun_init we simply impute the masked pvalues ad-hocly by \(\min\{p_i, 1 - p_i\}\) and run mufun on the imputed data. The following chunk gives the exemplary code for adapt_glm. Note that the code is a over-simplified version of the implementation in adaptMT package.

pifun_init_glm <- function(x, pvals, s, 
                       formula, ...){
    # The input `formula` is a half-formula of type character, e.g. ~ ns(x, df = 8)
    J <- ifelse(
        pvals < s | pvals > 1 - s, 1,
        2 * s / (2 * s - 1)
        )

    # Complete the formula by adding the response variable J
    complete_formula <- as.formula(paste("J", formula)) 
    data <- data.frame(x, J = J)
    mod <- stats::glm(complete_formula, family = gaussian(), data = data)
    
    fitv <- as.numeric(predict(mod, type = "response"))
    info <- list(df = fit$rank)
    
    return(list(fitv = fitv, info = info))
}

mufun_init_glm <- function(x, pvals, s, ...){
    phat <- ifelse(
        pvals < s | pvals > 1 - s,
        pmin(pvals, 1 - pvals),
        pvals
        )
    yhat <- dist$g(phat)

    # Complete the formula by adding the response variable yhat
    complete_formula <- as.formula(paste("yhat", formula)) 
    data <- data.frame(x, yhat = yhat)
    mod <- stats::glm(complete_formula, family = Gamma(), data = data)
    
    fitv <- as.numeric(predict(mod, type = "response"))
    info <- list(df = fit$rank)
    
    return(list(fitv = fitv, info = info))
}

Constructing adapt_model objects by gen_adapt_model

adaptMT package provides a generic function gen_adapt_model to construct adapt_model objects. The code to generate an adapt_model object characterizing a Logistic-Gamma GLM on the spline bases of a univariate covariate \(x\) with \(5\) equi-quantile knots to fit \(\pi_{1}(x)\) and \(\mu(x)\).

library("splines") 
piargs <- muargs <- list(formula = "~ ns(x, df = 5)")
adapt_model_glm_example <- gen_adapt_model(
  pifun = pifun_glm, mufun = mufun_glm, 
  pifun_init = pifun_init_glm, mufun_init = mufun_init_glm,
  piargs = piargs, muargs = muargs,
  piargs_init = piargs, muargs_init = muargs
)

adaptMT package also provides convenient ways to generate adapt_models for GLMs, GAMs and glmnets by just specifying piargs and muargs.

adapt_model_glm_example2 <- gen_adapt_model(
  name = "glm",
  piargs = piargs, muargs = muargs
)

This yields the same adapt_model object that is implicitly used in adapt_glm. We show the equivalence on a toy dataset below.

data(estrogen)
pvals <- as.numeric(estrogen$pvals)
x <- data.frame(x = as.numeric(estrogen$ord_high))
dist <- beta_family()
inds <- (x$x <= 1000)
pvals <- pvals[inds]
x <- x[inds,,drop = FALSE]

res1 <- adapt_glm(x = x, pvals = pvals, pi_formulas = "~ ns(x, df = 5)", mu_formulas = "~ ns(x, df = 5)")
#> alpha = 0.09: FDPhat 0.0891, Number of Rej. 842
#> alpha = 0.08: FDPhat 0.0791, Number of Rej. 834
#> alpha = 0.07: FDPhat 0.0699, Number of Rej. 815
#> alpha = 0.06: FDPhat 0.059, Number of Rej. 797
#> alpha = 0.05: FDPhat 0.0499, Number of Rej. 741
#> alpha = 0.04: FDPhat 0.0397, Number of Rej. 705
#> alpha = 0.03: FDPhat 0.0299, Number of Rej. 669
#> alpha = 0.02: FDPhat 0.0193, Number of Rej. 569
res2 <- adapt(x = x, pvals = pvals, models = adapt_model_glm_example2)
#> alpha = 0.09: FDPhat 0.0891, Number of Rej. 842
#> alpha = 0.08: FDPhat 0.0791, Number of Rej. 834
#> alpha = 0.07: FDPhat 0.0699, Number of Rej. 815
#> alpha = 0.06: FDPhat 0.059, Number of Rej. 797
#> alpha = 0.05: FDPhat 0.0499, Number of Rej. 741
#> alpha = 0.04: FDPhat 0.0397, Number of Rej. 705
#> alpha = 0.03: FDPhat 0.0299, Number of Rej. 669
#> alpha = 0.02: FDPhat 0.0193, Number of Rej. 569
identical(res1$s, res2$s)
#> [1] TRUE

Bioliography

Hastie, Trevor, and Junyang Qian. 2014. “Glmnet Vignette.” Technical report, Stanford.

Lei, Lihua, and William Fithian. 2016. “AdaPT: An Interactive Procedure for Multiple Testing with Side Information.” arXiv Preprint arXiv:1609.06035.

Wood, Simon N. 2006. Generalized Additive Models: An Introduction with R. Chapman; Hall/CRC.