calculate_weights.Rd
Calculate weights for comparison variables based on \(m\) and \(u\) probabilities estimated from a verified dataset.
calculate_weights(
data,
variables,
compare_type = "stringdist",
suffixes = c("_1", "_2"),
non_negative = FALSE
)
data.frame. Verified data. Should have all of the variables you want to calculate weights for from both datasets, named the same with data-specific suffixes.
character vector of the variable names of the variables you want to calculate weights for.
character vector. One of 'stringdist' (for string variables) 'ratio','difference' (for numerics) 'indicator' (0-1 dummy indicating if the two are the same),'in' (0-1 dummy indicating if data1 is IN data2), and 'substr' (numeric indicating how many digits are the same.)
character vector. Suffixes of of the variables that indicate what data they are from. Default is same as the default for base R merge, c('.x','.y')
logical. Do you want to allow negative weights?
list with m probabilities, u probabilites, w weights, and settings, the list argument requried as an input for score_settings in merge_plus using the calculate weights.
This function uses the classic Record Linkage methodology first developed by Felligi and Sunter.
See Record Linkage. \(m\) is the
probability of a given link between observations is a true match, while \(u\) is the probability
of an unlinked pair of observations being a true match. calculate_weights
computes a preliminary weight for each variable by computing
$$w = \log_2 (\frac{m}{u}),$$
then making these weights sum to 1. Thus, the weights that have higher \(m\)
and lower \(u\) probabilities will get higher weights, which makes sense given
the definitions. These weights can then be easily passed into the score_settings
argument of merge_plus
or tier_match
, or into the wgts
argument of
multivar_match
.