Association testing by combining several matching thresholds

Computes association test p-values from a generalized linear model for each considered threshold, and computes a p-value for the combination of all the envisioned thresholds through Fisher's method using perturbation resampling.

Usage

atlas(
  match_prob,
  y,
  x,
  covar = NULL,
  thresholds = seq(from = 0.1, to = 0.9, by = 0.2),
  nb_perturb = 200,
  dist_family = c("gaussian", "binomial"),
  impute_strategy = c("weighted average", "best")
)

Arguments

match_prob: matching probabilities matrix (e.g. obtained through recordLink) of dimensions n1 x n2.
y: response variable of length n1. Only binary phenotypes are supported at the moment.
x: a matrix or a data.frame of predictors of dimensions n2 x p. An intercept is automatically added within the function.
covar: a matrix or a data.frame of variables to be adjusted on in the test of dimensions n3 x p. Default is NULL in which case there is no adjustment.
thresholds: a vector (possibly of length 1) containing the different threshold to use to call a match. Default is seq(from = 0.5, to = 0.95, by = 0.05).
nb_perturb: the number of perturbation used for the p-value combination. Default is 200.
dist_family: a character string indicating the distribution family for the glm. Currently, only 'gaussian' and 'binomial' are supported. Default is 'gaussian'.
impute_strategy: a character string indicating which strategy to use to impute x from the matching probabilities match_prob. Either "best" (in which case the highest probable match above the threshold is imputed) or "weighted average" (in which case weighted mean is imputed for each individual who has at least one match with a posterior probability above the threshold). Default is "weighted average".

Value

a list containing the following:

influencefn_pvals p-values obtained from influence function perturbations with the covariates as columns and the thresholds as rows, with an additional row at the top for the combination
wald_pvals a matrix containing the p-values obtained from the Wald test with the covariates as columns and the thresholds as rows
ptbed_pvals a list containing, for each covariates, a matrix with the nb_perturb perturbed p-values with the different thresholds as rows
theta_impute a matrix of the estimated coefficients from the glm when imputing the weighted average for covariates (as columns) with the thresholds as rows
sd_theta a matrix of the estimated SD (from the influence function) of the coefficients from the glm when imputing the weighted average for covariates (as columns), with the thresholds as rows
ptbed_theta_impute a list containing, for each covariates, a matrix with the nb_perturb perturbed estimated coefficients from the glm when imputing the weighted average for covariates, with the different thresholds as rows
impute_strategy a character string indicating which impute strategy was used (either "weighted average" or "best")

References

Zhang HG, Hejblum BP, Weber G, Palmer N, Churchill S, Szolovits P, Murphy S, Liao KP, Kohane I and Cai T, ATLAS: An automated association test using probabilistically linked health records with application to genetic studies, JAMIA, in press (2021). doi:10.1093/jamia/ocab187 .

Examples

#rm(list=ls())

n_sims <- 1#5000

mysim <- function(i){
 x <- matrix(ncol=2, nrow=99, stats::rnorm(n=99*2))
 #plot(density(rbeta(n=1000, 1,2)))
 match_prob <- matrix(rbeta(n=103*99, 1, 2), nrow=103, ncol=99)

 #y <- rnorm(n=103, mean = 1, sd = 0.5)
 #return(atlas(match_prob, y, x, dist_family="gaussian")$influencefn_pvals)
 y <- rbinom(n=103, size = 1, prob=0.5)
 return(atlas(match_prob, y, x, dist_family="binomial")$influencefn_pvals)
}
#res <- pbapply::pblapply(1:n_sims, mysim, cl = parallel::detectCores()-1)
res <- lapply(1:n_sims, mysim)

size <- sapply(1:(ncol(res[[1]])-2), 
              FUN = function(i){
           rowMeans(sapply(res, function(m){m[, i]<0.05}), na.rm = TRUE)
           }
)
rownames(size) <- rownames(res[[1]])
colnames(size) <- colnames(res[[1]])[-(-1:0 + ncol(res[[1]]))]
size
#>                  (Intercept) x_impute1 x_impute2
#> Combined p-value           0         0         0
#> 0.1                        0         0         0
#> 0.3                        0         0         0
#> 0.5                        0         0         0
#> 0.7                        0         0         0
#> 0.9                        0         0         0