multivar_match computes a multivar_score between each pair of observations between datasets x and y using several variables, then executes a merge by picking the highest multivar_score pair for each observation in x.

multivar_match(
  data1,
  data2,
  by = NULL,
  by.x = NULL,
  by.y = NULL,
  unique_key_1,
  unique_key_2,
  logit = NULL,
  missing = FALSE,
  wgts = NULL,
  compare_type = "diff",
  blocks = NULL,
  blocks.x = NULL,
  blocks.y = NULL,
  nthread = 1,
  top = 1,
  threshold = NULL,
  suffixes = c("_1", "_2")
)

Arguments

data1

data.frame. First to-merge dataset.

data2

data.frame. Second to-merge dataset.

by

character string. Variables to merge on (common across data 1 and data 2). See merge

by.x

character string. Variable to merge on in data1. See merge

by.y

character string. Variable to merge on in data2. See merge

unique_key_1

character vector. Primary key of data1 that uniquely identifies each row (can be multiple fields)

unique_key_2

character vector. Primary key of data2 that uniquely identifies each row (can be multiple fields)

logit

a glm or lm model as a result from a logit regression on a verified dataset. See details.

missing

boolean T/F, whether or not to treat missing (NA) observations as its own binary column for each column in by. See details.

wgts

rather than a lm model, you can supply weights to calculate multivar_score. Can be weights from calculate_weights.

compare_type

a vector with the same length as "by" that describes how to compare the variables. Options are "in", "indicator", "substr", "difference", "ratio", "stringdist", and "wgt_jaccard_dist". See the Multivar Matching Vignette for details.

blocks

variable present in both data sets to "block" on before computing scores. multivar_scores will only be computed for observations that share a block. See details.

blocks.x

name of blocking variables in x. cannot supply both blocks and blocks.x

blocks.y

name of blocking variables in y. cannot supply both blocks and blocks.y

nthread

integer. Number of cores to use when computing all combinations. See parallel::makecluster()

top

integer. Number of matches to return for each observation.

threshold

numeric. Minimum score for a match to be included in the result.

suffixes

see merge

Value

a data.table, the resultant match, including columns from both data sets.

Details

The best way to understand this function is to see the vignette 'Multivar_matching'.

There are two ways of performing this match: either with or without a pre-trained logit. To use a logit, you must have a verified set of matches. The names of the variables in this set must match the names of the variables in the data you pass into multivar_match. Without a pre-trained logit, you must have a set of weights for each variable that you want in the comparison. These can either be made up ahead of time, or you can use a verified set of matches and calculate_weights.