Perform an iterative match by tier

Constructs a tier_match by running merge_plus with different parameters sequentially on the same data. Allows for sequential removal of observations after each tier.

tier_match(
  data1,
  data2,
  by = NULL,
  by.x = NULL,
  by.y = NULL,
  suffixes = c("_1", "_2"),
  check_merge = TRUE,
  unique_key_1,
  unique_key_2,
  tiers = list(),
  takeout = "both",
  match_type = "exact",
  clean = FALSE,
  clean_settings = build_clean_settings(),
  score_settings = NULL,
  filter = NULL,
  filter.args = list(),
  evaluate = match_evaluate,
  evaluate.args = list(),
  allow.cartesian = TRUE,
  fuzzy_settings = build_fuzzy_settings(),
  multivar_settings = build_multivar_settings(),
  verbose = FALSE
)

Arguments

data1: data.frame. First to-merge dataset.
data2: data.frame. Second to-merge dataset.
by: character string. Variables to merge on (common across data 1 and data 2). See merge
by.x: character string. Variable to merge on in data1. See merge
by.y: character string. Variable to merge on in data2. See merge
suffixes: see merge
check_merge: logical. Checks that your unique_keys are indeed unique, and prevents merge from running if merge would result in data.frames larger than 5 million rows
unique_key_1: character vector. Primary key of data1 that uniquely identifies each row (can be multiple fields)
unique_key_2: character vector. Primary key of data2 that uniquely identifies each row (can be multiple fields)
tiers: list(). tier is a list of lists, where each list holds the parameters for creating that tier. All arguments to tier_match listed after this argument can either be supplied directly to tier_match, or indirectly via tiers.
takeout: character vector, either 'data1', 'data2', 'both', or 'neither'. Removes observations after each tier from the selected dataset.
match_type: string. If 'exact', match is exact, if 'fuzzy', match is fuzzy.
clean: Boolean, T/F, whether or not to clean strings prior to the match.
clean_settings: list. Settings for string cleaning. See clean_strings and build_clean_settings.
score_settings: list. Settings for post-hoc matchscoring. See build_score_settings.
filter: function or numeric. Filters a merged data1-data2 dataset. If a function, should take in a data.frame (data1 and data2 merged by name1 and name2) and spit out a trimmed verion of the data.frame (fewer rows). Think of this function as applying other conditions to matches, other than a match by name. The first argument of filter should be the data.frame. If numeric, will drop all observations with a matchscore lower than or equal to filter.
filter.args: list. Arguments passed to filter, if a function
evaluate: Function to evalute merge_plus output. see evaluate_match.
evaluate.args: list. Arguments passed to function specified by evaluate
allow.cartesian: whether or not to allow many-many matches, see data.table::merge()
fuzzy_settings: additional arguments for amatch, to be used if match_type = 'fuzzy'. Suggested defaults provided. (see amatch, method='jw')
multivar_settings: list of settings to go to the multivar match if match_type == 'multivar'. See multivar-match.
verbose: boolean, whether or not to print tier names and time to match each tier as the matching happens.

Value

list with matches, data1 and data2 minus matches, and match evaluation

Details

See the tier match vignette to get a clear understanding of the tier_match syntax.

Arguments

Value

Details

See also