Use string distances to match on names — fuzzy

Use the stringdist package to perform a fuzzy match on two datasets.

fuzzy_match(
  data1,
  data2,
  by = NULL,
  by.x = NULL,
  by.y = NULL,
  suffixes,
  unique_key_1,
  unique_key_2,
  fuzzy_settings = list(method = "jw", p = 0.1, maxDist = 0.05, matchNA = FALSE, nthread
    = getOption("sd_num_thread"))
)

Arguments

data1: data.frame. First to-merge dataset.
data2: data.frame. Second to-merge dataset.
by: character string. Variables to merge on (common across data 1 and data 2). See merge
by.x: character string. Variable to merge on in data1. See merge
by.y: character string. Variable to merge on in data2. See merge
suffixes: character vector with length==2. Suffix to add to like named variables after the merge. See merge
unique_key_1: character vector. Primary key of data1 that uniquely identifies each row (can be multiple fields)
unique_key_2: character vector. Primary key of data2 that uniquely identifies each row (can be multiple fields)
fuzzy_settings: list of arguments to pass to to the fuzzy matching function. See amatch.

Value

a data.table, the resultant merged data set, including all columns from both data sets.

Details

stringdist amatch computes string distances between every pair of strings in two vectors, then picks the closest string pair for each observation in the dataset. This is used by fuzzy_match to perform a string distance-based match between two datasets. This process can take quite a long time, for quicker matches try adjusting the nthread argument in fuzzy_settings. The default fuzzy_settings are sensible starting points for company name matching, but adjusting these can greatly change how the match performs.