Calculate weights for comparison variables based on \(m\) and \(u\) probabilities estimated from a verified dataset.

calculate_weights(
  data,
  variables,
  compare_type = "stringdist",
  suffixes = c("_1", "_2"),
  non_negative = FALSE
)

Arguments

data

data.frame. Verified data. Should have all of the variables you want to calculate weights for from both datasets, named the same with data-specific suffixes.

variables

character vector of the variable names of the variables you want to calculate weights for.

compare_type

character vector. One of 'stringdist' (for string variables) 'ratio','difference' (for numerics) 'indicator' (0-1 dummy indicating if the two are the same),'in' (0-1 dummy indicating if data1 is IN data2), and 'substr' (numeric indicating how many digits are the same.)

suffixes

character vector. Suffixes of of the variables that indicate what data they are from. Default is same as the default for base R merge, c('.x','.y')

non_negative

logical. Do you want to allow negative weights?

Value

list with m probabilities, u probabilites, w weights, and settings, the list argument requried as an input for score_settings in merge_plus using the calculate weights.

Details

This function uses the classic Record Linkage methodology first developed by Felligi and Sunter. See Record Linkage. \(m\) is the probability of a given link between observations is a true match, while \(u\) is the probability of an unlinked pair of observations being a true match. calculate_weights computes a preliminary weight for each variable by computing $$w = \log_2 (\frac{m}{u}),$$ then making these weights sum to 1. Thus, the weights that have higher \(m\) and lower \(u\) probabilities will get higher weights, which makes sense given the definitions. These weights can then be easily passed into the score_settings argument of merge_plus or tier_match, or into the wgts argument of multivar_match.