Using-tier-match.Rmd
tier_match
is the ultimate wrapper function in
fedmatch.
tier_match
puts together all of the
pieces from the package into one function, letting the user perform many
matches in one call. The function is excellent both as an exploratory
tool, while the user is still figuring out how they want to execute
their matches, and as a final matching tool that can be used in
production code.
‘tiers’ of a match are useful because there are hierarchies of matches. An exact name match between two companies is a higher-quality match than a fuzzy match, and fuzzy matches with various levels of cleaning can be different levels of quality.
The syntax of tier_match
is providing a core list of
arguments to the function itself, and then passing a named list to the
tier match. Each element in this list is itself a list, each of which is
a tier to match on, and it contains all of the arguments necessary for
that tier. All of these arguments will be passed to ‘merge_plus’ in
sequence, and each of the matches from each tier are saved and
combined.
tier_list <- list(
a = build_tier(match_type = "exact"),
b = build_tier(match_type = "fuzzy"),
c = build_tier(match_type = "multivar", multivar_settings = build_multivar_settings(
logit = NULL, missing = FALSE, wgts = 1,
compare_type = "stringdist", blocks = NULL, blocks.x = NULL, blocks.y = NULL,
top = 1, threshold = NULL
))
)
# tier_list
This list will perform three matches: ‘a’, an exact match; ‘b’, a
fuzzy match, and ‘c’, a multivar match. We can get a bit fancier and add
more settings to each, if we’d like. Remember that each element of each
tier has to be an argument for merge_plus
.
tier_list_v2 <- list(
a = build_tier(match_type = "exact", clean = TRUE),
b = build_tier(match_type = "fuzzy", clean = TRUE,
fuzzy_settings = build_fuzzy_settings(method = "wgt_jaccard",
maxDist = .7,
nthread = 1),
clean_settings = build_clean_settings(remove_words = TRUE)),
c = build_tier(match_type = "multivar",
multivar_settings = build_multivar_settings(
logit = NULL, missing = FALSE, wgts = 1,
compare_type = "stringdist", blocks = NULL, blocks.x = NULL, blocks.y = NULL,
top = 1, threshold = NULL
))
)
Let’s take a look at the rest of the syntax for
tier_match
:
result <- tier_match(corp_data1, corp_data2,
by.x = "Company", by.y = "Name",
unique_key_1 = "unique_key_1", unique_key_2 = "unique_key_2",
tiers = tier_list_v2, takeout = "neither", verbose = TRUE,
score_settings = build_score_settings(score_var_x = "Company",
score_var_y = "Name",
wgts = 1,
score_type = "stringdist")
)
#> Matching tier 'a'...
#> Time elapsed: 0.02 secs.
#> Matching tier 'b'...
#> Time elapsed: 0.01 secs.
#> Matching tier 'c'...
#> Time elapsed: 0.02 secs.
There are two types of arguments for tier_match
: those
that can be passed to merge_plus
, and those that are unique
to tier_match
. If anything of the merge_plus
arguments are listed in tier_match
directly (rather than in
tier_list
), those arguments are used in every tier. In this
example, we are always matching on ‘Company’ and ‘Name,’ so those are
placed in the arguments for tier_match directly. The arguments unique to
tier_match
and their defaults are:
tiers
is the tier list create by iterations of
build_tier()
. Required, no default.takeout
is a character vector, either “neither”,
“both”, “data1”, or “data2”. These settings describe whether or not to
take out matches in between each tier, and if so, what dataset to remove
the matches for.verbose
is a boolean. If TRUE
, prints tier
names and time taken to match each tier.The other arguments are all present in merge_plus
, see
documentation there for details.
The result for tier_match is a list with 4 items: the matched dataset, the unmatched data, and a match evaluation. Here’s what the matches look like:
result$matches[1:5]
#> Company Country State SIC Revenue unique_key_1 country state_code
#> 1: walmart USA OH 3300 485 1 USA OH
#> 2: walmart USA OH 3300 485 1 USA OH
#> 3: Walmart USA OH 3300 485 1 USA OH
#> 4: Bershire Hataway USA 2222 223 2 USA NE
#> 5: apple USA CA 3384 215 3 USA CA
#> SIC_code earnings unique_key_2 Name matchscore Company_score
#> 1: 3380 490,000 1 walmart 1.0000000 1.0000000
#> 2: 3380 490,000 1 walmart 1.0000000 1.0000000
#> 3: 3380 490,000 1 Walmart 1.0000000 1.0000000
#> 4: 2220 220,000 2 Bershire Hathaway 0.9882353 0.9882353
#> 5: NA 220,000 3 apple computer 0.8714286 0.8714286
#> tier Company_compare multivar_score
#> 1: a NA NA
#> 2: b NA NA
#> 3: c 1.0000000 1.0000000
#> 4: c 0.9882353 0.9882353
#> 5: b NA NA
As you can see, the matches dataset has a column called ‘tier’ that
indicates which tier the match was from. It also adds any additional
columns added by the matching process. In this example, we see
‘Company_score’, created from the from the post-hoc scoring;
‘wgt_jaccard_sim’, the Weighted Jaccard similarity, created when using
the ‘wgt_jaccard’ setting of fuzzy_match
(see the
‘Fuzzy-matching’ vignette for more details); and ‘Company_compare’,
created from the multivar matching tier.
We also have a match evaluation, now filled out with more details broken down by tier:
result$match_evaluation
#> tier matches in_tier_unique_1 in_tier_unique_2 pct_matched_1 pct_matched_2
#> 1: a 2 2 2 0.2 0.2
#> 2: b 7 7 7 0.7 0.7
#> 3: c 10 10 9 1.0 0.9
#> 4: all 19 10 9 1.0 0.9
#> new_unique_1 new_unique_2
#> 1: 2 2
#> 2: 5 5
#> 3: 3 2
#> 4: NA NA
We can use this evaluation to figure out which tiers did the ‘best’ job matching, getting the most unique matches.