Constructs a tier_match by running merge_plus with different parameters sequentially on the same data. Allows for sequential removal of observations after each tier.

tier_match(
  data1,
  data2,
  by = NULL,
  by.x = NULL,
  by.y = NULL,
  suffixes = c("_1", "_2"),
  check_merge = TRUE,
  unique_key_1,
  unique_key_2,
  tiers = list(),
  takeout = "both",
  match_type = "exact",
  clean = FALSE,
  clean_settings = build_clean_settings(),
  score_settings = NULL,
  filter = NULL,
  filter.args = list(),
  evaluate = match_evaluate,
  evaluate.args = list(),
  allow.cartesian = TRUE,
  fuzzy_settings = build_fuzzy_settings(),
  multivar_settings = build_multivar_settings(),
  verbose = FALSE
)

Arguments

data1

data.frame. First to-merge dataset.

data2

data.frame. Second to-merge dataset.

by

character string. Variables to merge on (common across data 1 and data 2). See merge

by.x

character string. Variable to merge on in data1. See merge

by.y

character string. Variable to merge on in data2. See merge

suffixes

see merge

check_merge

logical. Checks that your unique_keys are indeed unique, and prevents merge from running if merge would result in data.frames larger than 5 million rows

unique_key_1

character vector. Primary key of data1 that uniquely identifies each row (can be multiple fields)

unique_key_2

character vector. Primary key of data2 that uniquely identifies each row (can be multiple fields)

tiers

list(). tier is a list of lists, where each list holds the parameters for creating that tier. All arguments to tier_match listed after this argument can either be supplied directly to tier_match, or indirectly via tiers.

takeout

character vector, either 'data1', 'data2', 'both', or 'neither'. Removes observations after each tier from the selected dataset.

match_type

string. If 'exact', match is exact, if 'fuzzy', match is fuzzy.

clean

Boolean, T/F, whether or not to clean strings prior to the match.

clean_settings

list. Settings for string cleaning. See clean_strings and build_clean_settings.

score_settings

list. Settings for post-hoc matchscoring. See build_score_settings.

filter

function or numeric. Filters a merged data1-data2 dataset. If a function, should take in a data.frame (data1 and data2 merged by name1 and name2) and spit out a trimmed verion of the data.frame (fewer rows). Think of this function as applying other conditions to matches, other than a match by name. The first argument of filter should be the data.frame. If numeric, will drop all observations with a matchscore lower than or equal to filter.

filter.args

list. Arguments passed to filter, if a function

evaluate

Function to evalute merge_plus output. see evaluate_match.

evaluate.args

list. Arguments passed to function specified by evaluate

allow.cartesian

whether or not to allow many-many matches, see data.table::merge()

fuzzy_settings

additional arguments for amatch, to be used if match_type = 'fuzzy'. Suggested defaults provided. (see amatch, method='jw')

multivar_settings

list of settings to go to the multivar match if match_type == 'multivar'. See multivar-match.

verbose

boolean, whether or not to print tier names and time to match each tier as the matching happens.

Value

list with matches, data1 and data2 minus matches, and match evaluation

Details

See the tier match vignette to get a clear understanding of the tier_match syntax.

See also

merge_plus clean_strings