Intro-to-fedmatch.Rmd
With an ever-expanding set of financial data providers, each covering
many of the same firms and entities, matching data sets that do not
share common identifiers has never been more
important.fedmatch
is a set of tools for performing record
linkage between two data sets. It allows for a variety of different
matching techniques, letting the user build a matching algorithm for
their specific application. Although fedmatch was designed with economic
data in mind (i.e. loans or companies), it is very flexible, so it can
be used for any matching problem. With fedmatch
, a
researcher or analyst can quickly go from having 0 matches between two
datasets to having many. With more time and care, they can use more
advanced techniques to pull out even more matches.
Fedmatch has many features, including:
This vignette will explain the basics of fedmatch, including the
merge_plus
function and the clean_strings
function. From there, other vignettes go further into the details of the
different matching types.
Before diving into the matching code, we’ll first go over the
clean_strings
function that can help standardize company
names across data sets.
A basic example of clean strings looks like this:
raw_names <- c("Hamlin, Hamlin, McGill", "Schweibert & Cokely ", "Wexler McGill, LLC",
"Davis and Main, Inc.")
clean_names <- clean_strings(raw_names)
clean_names
#> [1] "hamlin hamlin mcgill" "schweibert and cokely" "wexler mcgill llc"
#> [4] "davis and main inc"
Without any additional arguments, clean_strings
does the
following:
As described in the manual for clean_strings
, one can
specify further arguments to remove words or try different replacements.
Fedmatch comes with a set of words that are commonly used for this, but
you can use whatever you’d like. (You can also use
word_frequency
to look for common words in your data.)
fedmatch::corporate_words[1:5]
#> abbr long.names
#> 1: accep acceptance
#> 2: amer america
#> 3: assoc associates
#> 4: cl company listed
#> 5: cmnty community
scrubbed_names <- clean_strings(raw_names, common_words = fedmatch::corporate_words)
scrubbed_names
#> [1] "hamlin hamlin mcgill"
#> [2] "schweibert and cokely"
#> [3] "wexler mcgill limited liability corporation"
#> [4] "davis and main incorporated"
Through string cleaning, we can make it so that even if two different databases record names differently (e.g. “Hamlin Hamlin McGill INC” vs “Hamlin Hamlin McGill Incorporated”), we will still count these records as a match.
merge_plus
The workhorse of fedmatch
is merge_plus
.
merge_plus
is an extremely flexible function that can
perform several different types of matches: exact, fuzzy, and
multivar.
Here are the example datasets that come with fedmatch:
fedmatch::corp_data1
#> Company Country State SIC Revenue unique_key_1
#> 1: Walmart USA OH 3300 485 1
#> 2: Bershire Hataway USA 2222 223 2
#> 3: Apple USA CA 3384 215 3
#> 4: Exxon Mobile USA TX 2222 205 4
#> 5: McKesson Germany MA 222 192 5
#> 6: UnitedHealth Group USA MA NA 184 6
#> 7: CVS Health USA RI 1112 177 7
#> 8: General Motors USA MI 2222 166 8
#> 9: AT&T USA TN 4000 163 9
#> 10: Ford Motor Company USA MI NA 151 10
fedmatch::corp_data2
#> Name country state_code SIC_code earnings unique_key_2
#> 1: Walmart USA OH 3380 490,000 1
#> 2: Bershire Hathaway USA NE 2220 220,000 2
#> 3: Apple Computer USA CA NA 220,000 3
#> 4: Exxon Mobile Inc. USA TX 2222 210,000 4
#> 5: McKesson Corp. MA 2222 190,000 5
#> 6: UnitedHealth Group USA MA 1130 180,000 6
#> 7: CVS RI 1122 180,000 7
#> 8: GM MI 2222 170,000 8
#> 9: AT & T USA TN 4000 160,000 9
#> 10: Ford Motor USA MI 2222 150,000 10
The most basic way to use merge_plus
is by simply making
it equivalent to base::merge
.
basic_merge <- merge_plus(data1 = corp_data1,
data2 = corp_data2,
by.x = "Company",
by.y = "Name", match_type = "exact",
unique_key_1 = "unique_key_1",
unique_key_2 = "unique_key_2")
This code will run merge on the “Company” and “Name” variables, and
return cases where the two have an exact match. The only differences
between this and base::merge
are
merge_plus
requires data1 and data2 each to have a
“unique key” that can be used to identify an observation.merge_plus
returns a list.Let’s take a look at each of the elements of the list returned by
merge_plus
. These will always be the same, no matter which
match_type
you select in merge_plus
.
The first item is the matches themselves. This is a
data.table
with one row for each matching observation,
along with all variables present in each data set.
print(basic_merge$matches)
#> Company Country State SIC Revenue unique_key_1 country
#> 1: Walmart USA OH 3300 485 1 USA
#> 2: UnitedHealth Group USA MA NA 184 6 USA
#> state_code SIC_code earnings unique_key_2 Name tier
#> 1: OH 3380 490,000 1 Walmart all
#> 2: MA 1130 180,000 6 UnitedHealth Group all
The next item is matches_filter
. In this example, it’s
empty, because we didn’t supply the argument filter
. If we
did supply filter
(which can either be a function that
filters, or a numeric cutoff for a matchscore (more on this later)), we
would see a subsample of the matches dataset.
print(basic_merge$matches_filter)
#> Null data.table (0 rows and 0 cols)
Next in the list is data1_nomatch
and
data2_nomatch
, which return the rows that were not matched
from the datasets.
print(basic_merge$data1_nomatch)
#> Company Country State SIC Revenue unique_key_1
#> 1: Bershire Hataway USA 2222 223 2
#> 2: Apple USA CA 3384 215 3
#> 3: Exxon Mobile USA TX 2222 205 4
#> 4: McKesson Germany MA 222 192 5
#> 5: CVS Health USA RI 1112 177 7
#> 6: General Motors USA MI 2222 166 8
#> 7: AT&T USA TN 4000 163 9
#> 8: Ford Motor Company USA MI NA 151 10
print(basic_merge$data2_nomatch)
#> Name country state_code SIC_code earnings unique_key_2
#> 1: Bershire Hathaway USA NE 2220 220,000 2
#> 2: Apple Computer USA CA NA 220,000 3
#> 3: Exxon Mobile Inc. USA TX 2222 210,000 4
#> 4: McKesson Corp. MA 2222 190,000 5
#> 5: CVS RI 1122 180,000 7
#> 6: GM MI 2222 170,000 8
#> 7: AT & T USA TN 4000 160,000 9
#> 8: Ford Motor USA MI 2222 150,000 10
Finally, there is match_evaluation
, which is a
data.table that summarizes how well the match worked. It shows the
number of matches in each dataset broken down by tier (more on tiers
later), along with the percent matched.
print(basic_merge$match_evaluation)
#> tier matches in_tier_unique_1 in_tier_unique_2 pct_matched_1 pct_matched_2
#> 1: all 2 2 2 0.2 0.2
#> new_unique_1 new_unique_2
#> 1: 2 2
We can also use merge_plus
to perform “fuzzy” matches. A
fuzzy match uses a string distance algorithm to compute the distance
between one string and a set of other strings, then picks the closest
string that’s over a certain threshold. fedmatch
uses
stringdist::amatch
to execute these matches, and you can
read more about string distances in the stringdist
package
documentation.
Here is an example of how this is implemented in
merge_plus
.
fuzzy_result <- merge_plus(data1 = corp_data1,
data2 = corp_data2,
by.x = "Company",
by.y = "Name", match_type = "fuzzy",
unique_key_1 = "unique_key_1",
unique_key_2 = "unique_key_2")
print(fuzzy_result$matches)
#> unique_key_2 unique_key_1 Country State SIC Revenue Company
#> 1: 1 1 USA OH 3300 485 Walmart
#> 2: 2 2 USA 2222 223 Bershire Hataway
#> 3: 6 6 USA MA NA 184 UnitedHealth Group
#> Name country state_code SIC_code earnings tier
#> 1: Walmart USA OH 3380 490,000 all
#> 2: Bershire Hathaway USA NE 2220 220,000 all
#> 3: UnitedHealth Group USA MA 1130 180,000 all
We can see that we picked up an additional match here: “Bershire
Hataway” and “Bershire Hathaway.” These are off by 1 character, so the
exact match didn’t pick them up, but the fuzzy match did. We can also
tweak the fuzzy match settings with the argument
fuzzy_settings
. This is a list that will be passed to
stringdist::amatch
.
fuzzy_result <- merge_plus(data1 = corp_data1,
data2 = corp_data2,
by.x = "Company",
by.y = "Name", match_type = "fuzzy",
fuzzy_settings = build_fuzzy_settings(maxDist = .5),
unique_key_1 = "unique_key_1",
unique_key_2 = "unique_key_2")
print(fuzzy_result$matches)
#> unique_key_2 unique_key_1 Country State SIC Revenue Company
#> 1: 1 1 USA OH 3300 485 Walmart
#> 2: 2 2 USA 2222 223 Bershire Hataway
#> 3: 3 3 USA CA 3384 215 Apple
#> 4: 4 4 USA TX 2222 205 Exxon Mobile
#> 5: 5 5 Germany MA 222 192 McKesson
#> 6: 6 6 USA MA NA 184 UnitedHealth Group
#> 7: 7 7 USA RI 1112 177 CVS Health
#> 8: 9 9 USA TN 4000 163 AT&T
#> 9: 10 8 USA MI 2222 166 General Motors
#> 10: 10 10 USA MI NA 151 Ford Motor Company
#> Name country state_code SIC_code earnings tier
#> 1: Walmart USA OH 3380 490,000 all
#> 2: Bershire Hathaway USA NE 2220 220,000 all
#> 3: Apple Computer USA CA NA 220,000 all
#> 4: Exxon Mobile Inc. USA TX 2222 210,000 all
#> 5: McKesson Corp. MA 2222 190,000 all
#> 6: UnitedHealth Group USA MA 1130 180,000 all
#> 7: CVS RI 1122 180,000 all
#> 8: AT & T USA TN 4000 160,000 all
#> 9: Ford Motor USA MI 2222 150,000 all
#> 10: Ford Motor USA MI 2222 150,000 all
So, cranking up the maxDist
(maximum distance between
strings, a threshold for determining matches) gave us a bunch more
matches. Note that we return multiple matches per the same unique key
sometimes, for example Ford Motor got matched to General Motors and Ford
Motor Company. There are many tweaks that one can make via
fuzzy_settings
, and these change the match behavior
significantly. It is worth exploring various options to see which make
the most sense for your specific application. See the fuzzy matching
vignette for more details, including a new method of string comparison
that we call a “Weighted Jaccard” comparison.
The final setting for match_type
in
merge_plus
is a “multivariable match”, or “multivar” for
short. This match is complex, and may take some playing around with the
code to fully understand how it works. We’ll just go over the basic
usage here. The idea behind the multivariable match is to use several
variables from each dataset to execute a match, rather than just using
the name of an entity.
One way is to take the set of variables (say, company name, state, and earnings), compare them with some numeric metric, and then perform a linear combination of those metrics to arrive at a final score. Then, you can compare each observation in one dataset to each other observation in the other dataset to pick the match with the highest score.
The other way is similar, but instead of a linear combination of scores, you can use a logit model. In this method, you create a hand-verified match set between your two datasets, then use a logit model to estimate how much each variable contributes to determining a match.
Here’s an example of the first method, the linear combination:
# for simplicity's sake, rename columns in corp_data2
data.table::setnames(corp_data2, c("Name", "country"), c("Company", "Country"))
multivar_linear_result <- merge_plus(corp_data1, corp_data2,
match_type = "multivar",
by = c("Country", "Company"),
unique_key_1 = "unique_key_1",
suffixes = c("_1", "_2"),
unique_key_2 = "unique_key_2",
multivar_settings = build_multivar_settings(compare_type = c("indicator", "stringdist"),
wgts = c(.5, .5),
top = 1))
multivar_linear_result$matches
#> unique_key_1 Company_1 Country_1 State SIC Revenue
#> 1: 1 Walmart USA OH 3300 485
#> 2: 2 Bershire Hataway USA 2222 223
#> 3: 3 Apple USA CA 3384 215
#> 4: 4 Exxon Mobile USA TX 2222 205
#> 5: 5 McKesson Germany MA 222 192
#> 6: 6 UnitedHealth Group USA MA NA 184
#> 7: 7 CVS Health USA RI 1112 177
#> 8: 9 AT&T USA TN 4000 163
#> 9: 8 General Motors USA MI 2222 166
#> 10: 10 Ford Motor Company USA MI NA 151
#> Company_2 Country_2 state_code SIC_code earnings unique_key_2
#> 1: Walmart USA OH 3380 490,000 1
#> 2: Bershire Hathaway USA NE 2220 220,000 2
#> 3: Apple Computer USA CA NA 220,000 3
#> 4: Exxon Mobile Inc. USA TX 2222 210,000 4
#> 5: McKesson Corp. MA 2222 190,000 5
#> 6: UnitedHealth Group USA MA 1130 180,000 6
#> 7: UnitedHealth Group USA MA 1130 180,000 6
#> 8: AT & T USA TN 4000 160,000 9
#> 9: Ford Motor USA MI 2222 150,000 10
#> 10: Ford Motor USA MI 2222 150,000 10
#> Country_compare Company_compare multivar_score tier
#> 1: 1 1.0000000 1.0000000 all
#> 2: 1 0.9882353 0.9941176 all
#> 3: 1 0.8714286 0.9357143 all
#> 4: 1 0.9333333 0.9666667 all
#> 5: 0 0.9285714 0.4642857 all
#> 6: 1 1.0000000 1.0000000 all
#> 7: 1 0.5333333 0.7666667 all
#> 8: 1 0.9111111 0.9555556 all
#> 9: 1 0.7333333 0.8666667 all
#> 10: 1 0.9111111 0.9555556 all
To specify this type of match, we put in the match_type
as “multivar,” and then we specified how we wanted the match to run by
passing the list multivar_settings
. Each element of this
list is a separate argument to go into multivar_match
. The
compare_type
argument tells the multivar how to compare
each variable in the by
argument. Because “Country” is a
binary variable, we specify “indicator”, and because “Company” is a
string variable, we specify “stringdist.” You can see a full list of
options for comparison in the multivar_match
documentation,
or equivalently, the merge_plus
score_settings
documentation. (merge_plus
has an option to compute
matchscores post-hoc, as a method of evaluation.)
Next, here’s an example the second method, using a logit model. First, we’ll set up a fake training table. Normally, one would construct a human-verified match set. Here, I just create a table where the first half are matches, the second half are a mix of matches and not, and then the two comparison variables are biased to be more of a match in the first half of the sample. This is just a way to ensure that our logit model gives us positive coefficients, so that our example makes a little more sense.
set.seed(111)
training_table <- data.table::data.table(match = c(rep(1, 5e4), sample(c(0,1 ), 5e4, replace = TRUE)),
Company_compare = seq(1, 0.00001, -.00001),
Country_compare = c(rep(1, 5e4), sample(c(1, 0), 5e4, replace = TRUE)))
# training_table
logit_model <- glm(match ~ Company_compare + Country_compare, family = "binomial",
data = training_table)
summary(logit_model)
#>
#> Call:
#> glm(formula = match ~ Company_compare + Country_compare, family = "binomial",
#> data = training_table)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.27701 0.01702 -75.04 <2e-16 ***
#> Company_compare 4.98075 0.04134 120.49 <2e-16 ***
#> Country_compare 0.56183 0.01821 30.85 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 112745 on 99999 degrees of freedom
#> Residual deviance: 83050 on 99997 degrees of freedom
#> AIC: 83056
#>
#> Number of Fisher Scoring iterations: 5
Then, we plug our logit model into the
multivar_settings
. The code will then use our trained logit
model on the variables we specified. Note that the name of the columns
in the training set must match the name of the variables in the match
datasets, with “_compare” at the end.
result <- merge_plus(corp_data1, corp_data2, by = c("Country", "Company"), unique_key_1 = "unique_key_1",
unique_key_2 = "unique_key_2",
match_type = "multivar",
multivar_settings = list(logit = logit_model, compare_type = c("indicator", "stringdist"),
wgts = NULL),
suffixes = c("_1", "_2"))
result$matches
#> unique_key_1 Company_1 Country_1 State SIC Revenue
#> 1: 1 Walmart USA OH 3300 485
#> 2: 2 Bershire Hataway USA 2222 223
#> 3: 3 Apple USA CA 3384 215
#> 4: 4 Exxon Mobile USA TX 2222 205
#> 5: 5 McKesson Germany MA 222 192
#> 6: 6 UnitedHealth Group USA MA NA 184
#> 7: 7 CVS Health USA RI 1112 177
#> 8: 9 AT&T USA TN 4000 163
#> 9: 8 General Motors USA MI 2222 166
#> 10: 10 Ford Motor Company USA MI NA 151
#> Company_2 Country_2 state_code SIC_code earnings unique_key_2
#> 1: Walmart USA OH 3380 490,000 1
#> 2: Bershire Hathaway USA NE 2220 220,000 2
#> 3: Apple Computer USA CA NA 220,000 3
#> 4: Exxon Mobile Inc. USA TX 2222 210,000 4
#> 5: McKesson Corp. MA 2222 190,000 5
#> 6: UnitedHealth Group USA MA 1130 180,000 6
#> 7: CVS RI 1122 180,000 7
#> 8: AT & T USA TN 4000 160,000 9
#> 9: Ford Motor USA MI 2222 150,000 10
#> 10: Ford Motor USA MI 2222 150,000 10
#> Country_compare Company_compare multivar_score tier
#> 1: 1 1.0000000 0.9861508 all
#> 2: 1 0.9882353 0.9853272 all
#> 3: 1 0.8714286 0.9740476 all
#> 4: 1 0.9333333 0.9808013 all
#> 5: 0 0.9285714 0.9660386 all
#> 6: 1 1.0000000 0.9861508 all
#> 7: 0 0.8366667 0.9473611 all
#> 8: 1 0.9111111 0.9786024 all
#> 9: 1 0.7333333 0.9496635 all
#> 10: 1 0.9111111 0.9786024 all
Note the last few columns in the data.table: we see the comparison
metrics, just like in the linear combination version of
multivar_match
. But, note that instead of computing a 50/50
linear combination like before, we are now computing a matchscore as the
fitted probability of a match based on our logit model. In this toy
example, the coefficients are a little strange because of the random
data we fed in. But, we see the behavior we’d expect: a higher company
name comparison and a country match gives us a higher matchscore.
We’ve covered the several different types of matching with
fedmatch
: exact matching, fuzzy matching, and multivar
matching. Each match is useful in its own right, and they become even
more useful when combined. That’s where the next step comes in: tier
matching with the function tier_match
. See the vignette for
tier match for more details.