fuzzy_match.Rd
Use the stringdist
package to perform a fuzzy match on two datasets.
data.frame. First to-merge dataset.
data.frame. Second to-merge dataset.
character string. Variables to merge on (common across data 1 and data 2). See merge
character string. Variable to merge on in data1. See merge
character string. Variable to merge on in data2. See merge
character vector with length==2. Suffix to add to like named variables after the merge. See merge
character vector. Primary key of data1 that uniquely identifies each row (can be multiple fields)
character vector. Primary key of data2 that uniquely identifies each row (can be multiple fields)
list of arguments to pass to to the fuzzy matching function. See amatch
.
a data.table, the resultant merged data set, including all columns from both data sets.
stringdist
amatch
computes string distances between every
pair of strings in two vectors, then picks the closest string pair for each
observation in the dataset. This is used by fuzzy_match
to perform
a string distance-based match between two datasets. This process can take quite a long time,
for quicker matches try adjusting the nthread
argument in fuzzy_settings
.
The default fuzzy_settings are sensible starting points for company name matching,
but adjusting these can greatly change how the match performs.