Fuzzy-matching.Rmd
library(fedmatch)
library(data.table)
Fuzzy matching is an essential part of the matching process. After
trying all the name cleaning that you can with
clean_strings
, you have gotten the ‘low hanging fruit’ of
your match, and now you need to move on to non-exact matches.
merge_plus
has a built-in setting for this called ‘fuzzy’
matching. It lets you match on strings that are similar, but not exactly
the same.
The basic idea behind fuzzy matching is to compute a numerical ‘distance’ between every potential string comparison, and then for each string in data set 1, pick the ‘closest’ string in data set 2. One can also specify a threshold such that every match is of a certain quality. The concept of ‘distance’ can be defined in several methods, explained below. Each method will define a similarity \(s\), \(0 \leq s \leq 1\) , such that higher values will mean the strings are more similar, and a corresponding distance \(d = 1 - s\).
The most common method of matching strings with fedmatch
uses the Jaro-Winkler string distance. The Jaro similarity is defined
as
\[s_j = \begin{cases} 0 & \text{if } m = 0 \\ \frac{1}{3} \left(\frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m - t}{m} \right) & \text{otherwise} \end{cases}\]
where \(|s_i|\) is the length of the \(i\)th string, \(m\) is the number of characters that match, and \(t\) is the number of transpositions (the number of letters to be interchanged to place them in the proper order.) For example, if \(s_1\) is “abcd” and \(s_2\) is “acbd”, then the number of transpositions is 1, the matching characters are 4, and the similarity is \(\frac{1}{3}\left( 1 + 1 + \frac{3}{4} \right) \approx .917\).
Winkler’s modification to this similarity is to weight the start of the strings more heavily. Thus, if the strings share a common prefix, the similarity will be higher. The Jaro-Winkler similarity is
\[s_w = s_j + l p (1-s_j),\]
where \(l\) is the shared prefix
length (up to four characters, so \(0 \leq l
\leq 4\) ), and \(p\) is just
user-defined constant, \(0 < p \leq
.25\), that defines how heavily to weight the shared prefix.
\(p=.1\) is the default used by
fedmatch
.
Another string distance metric is called the Jaccard Similarity. If \(A\) is the set of letters in one string, and \(B\) is the set of letters in another string, then the Jaccard similarity \(J\) is defined as
\[J = \frac{A \bigcap B}{A \bigcup B}.\]
Thus, if the two strings share all of the same letters, then the distance is 0, or the similarity is 1.
We take one step further and innovate to create a new measure of string distance, which we call the Weighted Jaccard distance. Rather than looking at the sets of letters in each string, we look at the set of words in each string. Then, we construct a corpus of all of the words in the two columns to be matched and compute the log inverse frequency for each word:
\[w_i = \log \left( \frac{N}{n_i} \right), \] where \(N\) is the total number of unique words in the sample, and \(n_i\) is the number of occurrences of word \(i\) in the sample.
For example, the word ‘Corporation’ appears frequently in the names of companies in common data sets, and so its log inverse frequency would be low. But, the word ‘Apple’ probably only appears in a few company names out of the many thousands in the data, so its log inverse frequency score would be high.
If these logged inverse frequencies are called \(w\), then the Weighted Jaccard similarity is computed as
\[ J_i = \frac{ \sum_j w_j, j \in A \bigcap B}{ \sum_k w_k, k \in A \bigcup B} \]
Thus, if the two strings share many words that are uniquely
identifying in the corpus, the Weighted Jaccard similarity will be
higher. We have found this method of matching is able to pick up many
more matches when used in conjunction with the Jaro-Winkler distance. To
our knowledge, such a Weighted Jaccard method has never been used
before, and is one of the key innovations that makes
fedmatch
unique.
Because fedmatch uses stringdist::amatch
for all other
options, one can use any of the similarity metrics listed in the
documentation for that function in fedmatch.
As shown in the ‘Intro to fedmatch’ vignette, the basic syntax of
using fuzzy match is just the same as exact matching, but changing the
‘match_type’ argument in merge_plus
:
fuzzy_result <- merge_plus(data1 = corp_data1,
data2 = corp_data2,
by.x = "Company",
by.y = "Name", match_type = "fuzzy",
unique_key_1 = "unique_key_1",
unique_key_2 = "unique_key_2")
print(fuzzy_result$matches)
#> unique_key_2 unique_key_1 Country State SIC Revenue Company
#> 1: 1 1 USA OH 3300 485 Walmart
#> 2: 2 2 USA 2222 223 Bershire Hataway
#> 3: 6 6 USA MA NA 184 UnitedHealth Group
#> Name country state_code SIC_code earnings tier
#> 1: Walmart USA OH 3380 490,000 all
#> 2: Bershire Hathaway USA NE 2220 220,000 all
#> 3: UnitedHealth Group USA MA 1130 180,000 all
Note that ‘Bershire Hathaway’ matched to ‘Bershire Hataway,’ which
makes sense - the names are only off by one character. All of the
settings of fuzzy matching are controlled in the ‘fuzzy_settings’
argument of merge_plus
, and these settings can be easily
modified with the function build_fuzzy_settings
.
build_fuzzy_settings
returns a list that is properly
formatted for merge_plus
.The options and their defaults
are:
method
, default “jw”, determines the string distance
metric to use.p
, default 0.1, is a numeric vector of length 1, that
determines the factor \(p\) in the
Jaro-Winkler string distance. Only used if
method == "jw"
.FALSE
, is a boolean to determine
whether to count NAs as a match (see stringdist::amatch
for
details.)The behavior of fuzzy_match
is such that you can get
different results from swapping data1 and data2. This is because
fuzzy_match
answers the question: “for each observation in
data1, which observation is closest in data2 (while being closest than
the maxDist)?” This can result in duplicate observations from data2, as
shown in the simple example below:
dummy_data1 <- data.table(id1 = 1:2,
name = "abd")
dummy_data2 <- data.table(id2 = 1,
name = "abc")
result1 <- fedmatch::merge_plus(
data1 = dummy_data1,
match_type = "fuzzy",
data2 = dummy_data2, by.x = "name", by.y = "name",
unique_key_1 = "id1", unique_key_2 = "id2",
suffixes = c("_1", "_2"), fuzzy_settings = build_fuzzy_settings(maxDist = .5))
print(result1$matches)
#> id2 id1 name_1 name_2 tier
#> 1: 1 1 abd abc all
#> 2: 1 2 abd abc all
Note how we have two observations of the id2 ‘1’, because each of these observations is the closest in the data to the observations in data1 (and they are closer together than the maxDist specified). Now, if we flip the datasets:
result1 <- fedmatch::merge_plus(
data1 = dummy_data2,
match_type = "fuzzy",
data2 = dummy_data1, by.x = "name", by.y = "name",
unique_key_1 = "id2", unique_key_2 = "id1",
suffixes = c("_1", "_2"), fuzzy_settings = build_fuzzy_settings(maxDist = .5))
print(result1$matches)
#> id1 id2 name_1 name_2 tier
#> 1: 1 1 abc abd all
We only get 1 observation, because in the case of ties,
fuzzy_match
will simply pick the first observation.
wgt_jaccard_match <- merge_plus(data1 = corp_data1,
data2 = corp_data2,
by.x = "Company",
by.y = "Name", match_type = "fuzzy",
fuzzy_settings = build_fuzzy_settings(method = "wgt_jaccard", nthread = 2,
maxDist = .5),
unique_key_1 = "unique_key_1",
unique_key_2 = "unique_key_2")
print(wgt_jaccard_match)
#> $matches
#> unique_key_2 unique_key_1 Country State SIC Revenue Company
#> 1: 1 1 USA OH 3300 485 Walmart
#> 2: 4 4 USA TX 2222 205 Exxon Mobile
#> 3: 6 6 USA MA NA 184 UnitedHealth Group
#> 4: 10 10 USA MI NA 151 Ford Motor Company
#> Name country state_code SIC_code earnings tier
#> 1: Walmart USA OH 3380 490,000 all
#> 2: Exxon Mobile Inc. USA TX 2222 210,000 all
#> 3: UnitedHealth Group USA MA 1130 180,000 all
#> 4: Ford Motor USA MI 2222 150,000 all
#>
#> $matches_filter
#> Null data.table (0 rows and 0 cols)
#>
#> $data1_nomatch
#> Company Country State SIC Revenue unique_key_1
#> 1: Bershire Hataway USA 2222 223 2
#> 2: Apple USA CA 3384 215 3
#> 3: McKesson Germany MA 222 192 5
#> 4: CVS Health USA RI 1112 177 7
#> 5: General Motors USA MI 2222 166 8
#> 6: AT&T USA TN 4000 163 9
#>
#> $data2_nomatch
#> Name country state_code SIC_code earnings unique_key_2
#> 1: Bershire Hathaway USA NE 2220 220,000 2
#> 2: Apple Computer USA CA NA 220,000 3
#> 3: McKesson Corp. MA 2222 190,000 5
#> 4: CVS RI 1122 180,000 7
#> 5: GM MI 2222 170,000 8
#> 6: AT & T USA TN 4000 160,000 9
#>
#> $match_evaluation
#> tier matches in_tier_unique_1 in_tier_unique_2 pct_matched_1 pct_matched_2
#> 1: all 4 4 4 0.4 0.4
#> new_unique_1 new_unique_2
#> 1: 4 4