Probabilistic Patient Record Linkage
Usage
recordLink(
data1,
data2,
dates1 = NULL,
dates2 = NULL,
eps_plus,
eps_minus,
aggreg_2ways = "mean",
min_prev = 0.01,
data1_cont2diff = NULL,
data2_cont2diff = NULL,
d_max,
use_diff = TRUE
)
Arguments
- data1
either a binary (
1
or0
values only) matrix or binary data frame of dimensionn1 x K
whose rownames are the observation identifiers.- data2
either a binary (
1
or0
values only) matrix or a binary data frame of dimensionn2 x K
whose rownames are the observation identifiers. Columns should be in the same order as indata1
.- dates1
matrix or dataframe of dimension
n1 x K
including the concatenated dates intervals for each corresponding diagnosis codes indata1
. Default isNULL
in which case dates are not used.- dates2
matrix or dataframe of dimension
n2 x K
including the concatenated dates intervals for each corresponding diagnosis codes indata2
. Default isNULL
in which case dates are not used. See details.- eps_plus
discrepancy rate between
data1
anddata2
- eps_minus
discrepancy rate between
data2
anddata1
- aggreg_2ways
a character string indicating how to merge the posterior two probability matrices obtained for each of the 2 databases. Four possibility are currently implemented:
"maxnorm"
,"max"
,"min"
,"mean"
and"prod"
. Default is"mean"
.- min_prev
minimum prevalence for the variables used in matching. Default is 1%.
- data1_cont2diff
either a matrix or dataframe of continuous features, such as age, for which the similarity measure uses the difference with
data2_cont2diff
, whose rownames are . Default isNULL
.- data2_cont2diff
either a matrix or dataframe of continuous features, such as age, for which the similarity measure uses the difference with
data2_cont1diff
, whose rownames are . Default isNULL
.- d_max
a numeric vector of length
K
giving the minimum difference from which it is considered a discrepancy.- use_diff
logical flag indicating whether continuous differentiable variables should be used in the
Details
Dates:
the use of dates1
and dates2
requires that at least one date interval matches across
dates1
and dates2
for claiming an agreement on a diagnosis code between data1
and data2
,
in addition of having that very same code recorded in both.
References
Hejblum BP, Weber G, Liao KP, Palmer N, Churchill S, Szolovits P, Murphy S, Kohane I and Cai T, Probabilistic Record Linkage of De-Identified Research Datasets Using Diagnosis Codes, Scientific Data, 6:180298 (2019). doi:10.1038/sdata.2018.298 .
Examples
set.seed(123)
ncodes <- 500
npat <- 200
incid <- abs(rnorm(n=ncodes, 0.15, 0.07))
bin_codes <- rbinom(n=npat*ncodes, size=1, prob=rep(incid, npat))
bin_codes_mat <- matrix(bin_codes, ncol=ncodes, byrow = TRUE)
data1_ex <- bin_codes_mat[1:(npat/2+npat/10),]
data2_ex <- bin_codes_mat[c(1:(npat/10), (npat/2+npat/10 + 1):npat), ]
rownames(data1_ex) <- paste0("ID", 1:(npat/2+npat/10), "_data1")
rownames(data2_ex) <- paste0("ID", c(1:(npat/10), (npat/2+npat/10 + 1):npat), "_data2")
if(interactive()){
res <- recordLink(data1 = data1_ex, data2 = data2_ex,
use_diff = FALSE, eps_minus = 0.01, eps_plus = 0.01)
round(res[c(1:3, 19:23), c(1:3, 19:23)], 3)
}