Calculate weights for computing matchscore — calculate

Calculate weights for comparison variables based on $m$ and $u$ probabilities estimated from a verified dataset.

calculate_weights(
  data,
  variables,
  compare_type = "stringdist",
  suffixes = c("_1", "_2"),
  non_negative = FALSE
)

Arguments

data: data.frame. Verified data. Should have all of the variables you want to calculate weights for from both datasets, named the same with data-specific suffixes.
variables: character vector of the variable names of the variables you want to calculate weights for.
compare_type: character vector. One of 'stringdist' (for string variables) 'ratio','difference' (for numerics) 'indicator' (0-1 dummy indicating if the two are the same),'in' (0-1 dummy indicating if data1 is IN data2), and 'substr' (numeric indicating how many digits are the same.)
suffixes: character vector. Suffixes of of the variables that indicate what data they are from. Default is same as the default for base R merge, c('.x','.y')
non_negative: logical. Do you want to allow negative weights?

Value

list with m probabilities, u probabilites, w weights, and settings, the list argument requried as an input for score_settings in merge_plus using the calculate weights.

Details

This function uses the classic Record Linkage methodology first developed by Felligi and Sunter. See Record Linkage. $m$ is the probability of a given link between observations is a true match, while $u$ is the probability of an unlinked pair of observations being a true match. calculate_weights computes a preliminary weight for each variable by computing $$w = \log_2 (\frac{m}{u}),$$ then making these weights sum to 1. Thus, the weights that have higher $m$ and lower $u$ probabilities will get higher weights, which makes sense given the definitions. These weights can then be easily passed into the score_settings argument of merge_plus or tier_match, or into the wgts argument of multivar_match.