Probabilistic Patient Record Linkage
Usage
recordLink(
data1,
data2,
dates1 = NULL,
dates2 = NULL,
eps_plus,
eps_minus,
aggreg_2ways = "mean",
min_prev = 0.01,
data1_cont2diff = NULL,
data2_cont2diff = NULL,
d_max,
use_diff = TRUE
)Arguments
- data1
either a binary (
1or0values only) matrix or binary data frame of dimensionn1 x Kwhose rownames are the observation identifiers.- data2
either a binary (
1or0values only) matrix or a binary data frame of dimensionn2 x Kwhose rownames are the observation identifiers. Columns should be in the same order as indata1.- dates1
matrix or dataframe of dimension
n1 x Kincluding the concatenated dates intervals for each corresponding diagnosis codes indata1. Default isNULLin which case dates are not used.- dates2
matrix or dataframe of dimension
n2 x Kincluding the concatenated dates intervals for each corresponding diagnosis codes indata2. Default isNULLin which case dates are not used. See details.- eps_plus
discrepancy rate between
data1anddata2- eps_minus
discrepancy rate between
data2anddata1- aggreg_2ways
a character string indicating how to merge the posterior two probability matrices obtained for each of the 2 databases. Four possibility are currently implemented:
"maxnorm","max","min","mean"and"prod". Default is"mean".- min_prev
minimum prevalence for the variables used in matching. Default is 1%.
- data1_cont2diff
either a matrix or dataframe of continuous features, such as age, for which the similarity measure uses the difference with
data2_cont2diff, whose rownames are . Default isNULL.- data2_cont2diff
either a matrix or dataframe of continuous features, such as age, for which the similarity measure uses the difference with
data2_cont1diff, whose rownames are . Default isNULL.- d_max
a numeric vector of length
Kgiving the minimum difference from which it is considered a discrepancy.- use_diff
logical flag indicating whether continuous differentiable variables should be used in the
Details
Dates: the use of dates1 and dates2 requires that at least one date interval matches across
dates1 and dates2 for claiming an agreement on a diagnosis code between data1 and data2,
in addition of having that very same code recorded in both.
References
Hejblum BP, Weber G, Liao KP, Palmer N, Churchill S, Szolovits P, Murphy S, Kohane I and Cai T, Probabilistic Record Linkage of De-Identified Research Datasets Using Diagnosis Codes, Scientific Data, 6:180298 (2019). doi:10.1038/sdata.2018.298 .
Examples
set.seed(123)
ncodes <- 500
npat <- 200
incid <- abs(rnorm(n=ncodes, 0.15, 0.07))
bin_codes <- rbinom(n=npat*ncodes, size=1, prob=rep(incid, npat))
bin_codes_mat <- matrix(bin_codes, ncol=ncodes, byrow = TRUE)
data1_ex <- bin_codes_mat[1:(npat/2+npat/10),]
data2_ex <- bin_codes_mat[c(1:(npat/10), (npat/2+npat/10 + 1):npat), ]
rownames(data1_ex) <- paste0("ID", 1:(npat/2+npat/10), "_data1")
rownames(data2_ex) <- paste0("ID", c(1:(npat/10), (npat/2+npat/10 + 1):npat), "_data2")
if(interactive()){
res <- recordLink(data1 = data1_ex, data2 = data2_ex,
use_diff = FALSE, eps_minus = 0.01, eps_plus = 0.01)
round(res[c(1:3, 19:23), c(1:3, 19:23)], 3)
}