multivar_match.Rd
multivar_match
computes a multivar_score between each pair of observations between
datasets x and y using several variables, then executes a merge by picking the
highest multivar_score pair for each observation in x.
multivar_match(
data1,
data2,
by = NULL,
by.x = NULL,
by.y = NULL,
unique_key_1,
unique_key_2,
logit = NULL,
missing = FALSE,
wgts = NULL,
compare_type = "diff",
blocks = NULL,
blocks.x = NULL,
blocks.y = NULL,
nthread = 1,
top = 1,
threshold = NULL,
suffixes = c("_1", "_2")
)
data.frame. First to-merge dataset.
data.frame. Second to-merge dataset.
character string. Variables to merge on (common across data 1 and data 2). See merge
character string. Variable to merge on in data1. See merge
character string. Variable to merge on in data2. See merge
character vector. Primary key of data1 that uniquely identifies each row (can be multiple fields)
character vector. Primary key of data2 that uniquely identifies each row (can be multiple fields)
a glm or lm model as a result from a logit regression on a verified dataset. See details.
boolean T/F, whether or not to treat missing (NA) observations as its own binary column for each column in by. See details.
rather than a lm model, you can supply weights to calculate multivar_score. Can be weights from calculate_weights
.
a vector with the same length as "by" that describes how to compare the variables. Options are "in", "indicator", "substr", "difference", "ratio", "stringdist", and "wgt_jaccard_dist". See the Multivar Matching Vignette for details.
variable present in both data sets to "block" on before computing scores. multivar_scores will only be computed for observations that share a block. See details.
name of blocking variables in x. cannot supply both blocks and blocks.x
name of blocking variables in y. cannot supply both blocks and blocks.y
integer. Number of cores to use when computing all combinations. See parallel::makecluster()
integer. Number of matches to return for each observation.
numeric. Minimum score for a match to be included in the result.
see merge
a data.table, the resultant match, including columns from both data sets.
The best way to understand this function is to see the vignette 'Multivar_matching'.
There are two ways of performing this match: either with or without a pre-trained logit.
To use a logit, you must have a verified set of matches. The names of the variables
in this set must match the names of the variables in the data you pass into multivar_match
.
Without a pre-trained logit, you must have a set of weights for each variable that you
want in the comparison. These can either be made up ahead of time, or you can
use a verified set of matches and calculate_weights
.