Probabilistic Patient Record Linkage

Usage

recordLink(
  data1,
  data2,
  dates1 = NULL,
  dates2 = NULL,
  eps_plus,
  eps_minus,
  aggreg_2ways = "mean",
  min_prev = 0.01,
  data1_cont2diff = NULL,
  data2_cont2diff = NULL,
  d_max,
  use_diff = TRUE
)

Arguments

data1: either a binary (1 or 0 values only) matrix or binary data frame of dimension n1 x K whose rownames are the observation identifiers.
data2: either a binary (1 or 0 values only) matrix or a binary data frame of dimension n2 x K whose rownames are the observation identifiers. Columns should be in the same order as in data1.
dates1: matrix or dataframe of dimension n1 x K including the concatenated dates intervals for each corresponding diagnosis codes in data1. Default is NULL in which case dates are not used.
dates2: matrix or dataframe of dimension n2 x K including the concatenated dates intervals for each corresponding diagnosis codes in data2. Default is NULL in which case dates are not used. See details.
eps_plus: discrepancy rate between data1 and data2
eps_minus: discrepancy rate between data2 and data1
aggreg_2ways: a character string indicating how to merge the posterior two probability matrices obtained for each of the 2 databases. Four possibility are currently implemented: "maxnorm", "max", "min", "mean" and "prod". Default is "mean".
min_prev: minimum prevalence for the variables used in matching. Default is 1%.
data1_cont2diff: either a matrix or dataframe of continuous features, such as age, for which the similarity measure uses the difference with data2_cont2diff, whose rownames are . Default is NULL.
data2_cont2diff: either a matrix or dataframe of continuous features, such as age, for which the similarity measure uses the difference with data2_cont1diff, whose rownames are . Default is NULL.
d_max: a numeric vector of length K giving the minimum difference from which it is considered a discrepancy.
use_diff: logical flag indicating whether continuous differentiable variables should be used in the

Value

a matrix of size n1 x n2 with the posterior probability of matching for each n1*n2 pair

Details

Dates: the use of dates1 and dates2 requires that at least one date interval matches across dates1 and dates2 for claiming an agreement on a diagnosis code between data1 and data2, in addition of having that very same code recorded in both.

References

Hejblum BP, Weber G, Liao KP, Palmer N, Churchill S, Szolovits P, Murphy S, Kohane I and Cai T, Probabilistic Record Linkage of De-Identified Research Datasets Using Diagnosis Codes, Scientific Data, 6:180298 (2019). doi:10.1038/sdata.2018.298 .

Examples

set.seed(123)
ncodes <- 500
npat <- 200
incid <- abs(rnorm(n=ncodes, 0.15, 0.07))
bin_codes <- rbinom(n=npat*ncodes, size=1,  prob=rep(incid, npat))
bin_codes_mat <- matrix(bin_codes, ncol=ncodes, byrow = TRUE)
data1_ex <- bin_codes_mat[1:(npat/2+npat/10),]
data2_ex <- bin_codes_mat[c(1:(npat/10), (npat/2+npat/10 + 1):npat), ]
rownames(data1_ex) <- paste0("ID", 1:(npat/2+npat/10), "_data1")
rownames(data2_ex) <- paste0("ID", c(1:(npat/10), (npat/2+npat/10 + 1):npat), "_data2")

if(interactive()){
res <- recordLink(data1 = data1_ex, data2 = data2_ex, 
                 use_diff = FALSE, eps_minus = 0.01, eps_plus = 0.01)
round(res[c(1:3, 19:23), c(1:3, 19:23)], 3)
}