Multivar_matching.Rmd
library(fedmatch)
library(data.table)
While name matching is covered well by the tools
infedmatch::merge_plus()
and
fedmatch::fuzzy_match()
, sometimes it is useful to pull in
additional information besides names. These could include company
information fields like industry code, zip codes, countries, and the
like. Or, for loans, it could include loan amounts, origination dates,
or industry codes. Or, one could even want to use multiple different
names at the same time. This is where
fedmatch::multivar_match()
comes into play. It lets you
compare two records based on multiple fields. It does so by assigning a
numeric value from 0-1 for each field, and then computing a weighted sum
(or predicts a probability with a logit model), arriving at a final
multivar_score for a given pair of records. It computes these
multivar_scores for every possible combination of records, and picks the
best match for each
The basic syntax is as follows, using the example company data in fedmatch:
corp_data1_test <- copy(corp_data1)
data.table::setDT(corp_data1_test)
corp_data2_test <- copy(corp_data2)
data.table::setDT(corp_data2_test)
corp_data1_test[, id_1 := seq(1, .N)]
corp_data2_test[, id_2 := seq(1, .N)]
corp_data2_test[, Country := country]
corp_data2_test[, Company := Name]
result <- merge_plus(
data1 = corp_data1_test,
data2 = corp_data2_test,
match_type = "multivar",
by = c("Country", "Company"),
suffixes = c("_1", "_2"),
unique_key_1 = "id_1",
unique_key_2 = "id_2",
multivar_settings = build_multivar_settings(
compare_type = c("indicator", "stringdist"),
wgts = c(.5, .5), nthread = 1
))
result
#> $matches
#> unique_key_1 Company_1 Country_1 State SIC Revenue id_1
#> 1: 1 Walmart USA OH 3300 485 1
#> 2: 2 Bershire Hataway USA 2222 223 2
#> 3: 3 Apple USA CA 3384 215 3
#> 4: 4 Exxon Mobile USA TX 2222 205 4
#> 5: 5 McKesson Germany MA 222 192 5
#> 6: 6 UnitedHealth Group USA MA NA 184 6
#> 7: 7 CVS Health USA RI 1112 177 7
#> 8: 9 AT&T USA TN 4000 163 9
#> 9: 8 General Motors USA MI 2222 166 8
#> 10: 10 Ford Motor Company USA MI NA 151 10
#> Name country state_code SIC_code earnings unique_key_2 id_2
#> 1: Walmart USA OH 3380 490,000 1 1
#> 2: Bershire Hathaway USA NE 2220 220,000 2 2
#> 3: Apple Computer USA CA NA 220,000 3 3
#> 4: Exxon Mobile Inc. USA TX 2222 210,000 4 4
#> 5: McKesson Corp. MA 2222 190,000 5 5
#> 6: UnitedHealth Group USA MA 1130 180,000 6 6
#> 7: UnitedHealth Group USA MA 1130 180,000 6 6
#> 8: AT & T USA TN 4000 160,000 9 9
#> 9: Ford Motor USA MI 2222 150,000 10 10
#> 10: Ford Motor USA MI 2222 150,000 10 10
#> Country_2 Company_2 Country_compare Company_compare multivar_score
#> 1: USA Walmart 1 1.0000000 1.0000000
#> 2: USA Bershire Hathaway 1 0.9882353 0.9941176
#> 3: USA Apple Computer 1 0.8714286 0.9357143
#> 4: USA Exxon Mobile Inc. 1 0.9333333 0.9666667
#> 5: McKesson Corp. 0 0.9285714 0.4642857
#> 6: USA UnitedHealth Group 1 1.0000000 1.0000000
#> 7: USA UnitedHealth Group 1 0.5333333 0.7666667
#> 8: USA AT & T 1 0.9111111 0.9555556
#> 9: USA Ford Motor 1 0.7333333 0.8666667
#> 10: USA Ford Motor 1 0.9111111 0.9555556
#> tier
#> 1: all
#> 2: all
#> 3: all
#> 4: all
#> 5: all
#> 6: all
#> 7: all
#> 8: all
#> 9: all
#> 10: all
#>
#> $matches_filter
#> Null data.table (0 rows and 0 cols)
#>
#> $data1_nomatch
#> Empty data.table (0 rows and 7 cols): Company,Country,State,SIC,Revenue,unique_key_1...
#>
#> $data2_nomatch
#> Name country state_code SIC_code earnings unique_key_2 id_2 Country Company
#> 1: CVS RI 1122 180,000 7 7 CVS
#> 2: GM MI 2222 170,000 8 8 GM
#>
#> $match_evaluation
#> tier matches in_tier_unique_1 in_tier_unique_2 pct_matched_1 pct_matched_2
#> 1: all 10 10 8 1 0.8
#> new_unique_1 new_unique_2
#> 1: 10 8
Let’s go through the arguments:
merge_plus
, except ‘by’ can contain multiple elements,
unlike with merge_plus
.parallel::makeCluster()
base::predict()
multivar_match
returns a data.table with the columns
from both data_1
and data_2
, just like
base::merge()
. In addition, it returns three columns for
each variable in by
: two for the original columns in the
data, and one column with the suffix ‘compare’ that has the numerical
value for the given comparison. For example:
print(result$matches[, .(Company_1, Company_2, Company_compare)])
#> Company_1 Company_2 Company_compare
#> 1: Walmart Walmart 1.0000000
#> 2: Bershire Hataway Bershire Hathaway 0.9882353
#> 3: Apple Apple Computer 0.8714286
#> 4: Exxon Mobile Exxon Mobile Inc. 0.9333333
#> 5: McKesson McKesson Corp. 0.9285714
#> 6: UnitedHealth Group UnitedHealth Group 1.0000000
#> 7: CVS Health UnitedHealth Group 0.5333333
#> 8: AT&T AT & T 0.9111111
#> 9: General Motors Ford Motor 0.7333333
#> 10: Ford Motor Company Ford Motor 0.9111111
print(result$matches[, .(Country_1, Country_2, Country_compare)])
#> Country_1 Country_2 Country_compare
#> 1: USA USA 1
#> 2: USA USA 1
#> 3: USA USA 1
#> 4: USA USA 1
#> 5: Germany 0
#> 6: USA USA 1
#> 7: USA USA 1
#> 8: USA USA 1
#> 9: USA USA 1
#> 10: USA USA 1
Further, it adds one additional column for the overall multivar_score. In our example, we set the weights each equal to 0.5, so the multivar_score is simply the average of our two comparison variables:
print(result$matches[, .(Company_compare, Country_compare, multivar_score)])
#> Company_compare Country_compare multivar_score
#> 1: 1.0000000 1 1.0000000
#> 2: 0.9882353 1 0.9941176
#> 3: 0.8714286 1 0.9357143
#> 4: 0.9333333 1 0.9666667
#> 5: 0.9285714 0 0.4642857
#> 6: 1.0000000 1 1.0000000
#> 7: 0.5333333 1 0.7666667
#> 8: 0.9111111 1 0.9555556
#> 9: 0.7333333 1 0.8666667
#> 10: 0.9111111 1 0.9555556
If you’re not sure what the weights for each variable should be, you
can use the function fedmatch::calculate_weights()
in
conjunction with a validated match set to get an estimate. This uses the
methodology from Felligi and Sunter’s seminal paper A Theory for
Record Linkage. You can find more information on the Record Linkage
Wikipedia page. Essentially, fedmatch looks for which variables are
most likely to match when two records match, and uses this criteria to
weight the variables. Here is an example of how this would work:
set.seed(111)
fake_result_table <- data.table::data.table(
Company_1 = c("ABC Corp", "XYZ Corporation", "Apple Corp", "Banana Corp"),
Company_2 = c("ABC Corporation", "XYZ Inc", "Apple Incorporated", "Banana Stand"),
Country_1 = c("USA", "USA", "TUR", "USA"),
Country_2 = c("MEX", "USA", "TUR", "USA")
)
calculated_weights <- calculate_weights(fake_result_table, c("Company", "Country"),
compare_type = c("stringdist", "indicator"),
suffixes = c("_1", "_2"))
calculated_weights$w
#> Company Country
#> 0.7349053 0.2650947
In this dummy example, the names are giving us more information about
the match quality than the countries, and so the weights are
correspondingly higher. If you’d like more details on the exact
methodology, see the manual documentation for
calculate_weights
.
Rather than using a linear combination of the comparison variables,
we can instead train a logit model on a training set of matches and
non-matches. The explanatory variables are each of the comparison
variables, and the dependent variable is a binary 1/0 for if the two
records are a match or not. This data set needs to be constructed by
hand, and is similar to the type of data set used with
calculate_weights.
However, it must contain both matches
and non-matches for the logit model to be well-trained. Here is an
example of how this would look:
set.seed(111)
corp_data1_test <- copy(corp_data1)
data.table::setDT(corp_data1_test)
corp_data2_test <- copy(corp_data2)
data.table::setDT(corp_data2_test)
corp_data1_test[, id_1 := seq(1, .N)]
corp_data2_test[, id_2 := seq(1, .N)]
corp_data2_test[, Country := country]
corp_data2_test[, Company := Name]
set.seed(111)
fake_result_table <- data.table::data.table(
match = sample(c(1, 0, 1), 1e5, replace = TRUE),
Company_compare = runif(1e5),
Country_compare = sample(c(1, 0), 1e5, replace = TRUE)
)
logit_model <- glm(match ~ Company_compare + Country_compare,
family = "binomial",
data = fake_result_table
)
summary(logit_model)
#>
#> Call:
#> glm(formula = match ~ Company_compare + Country_compare, family = "binomial",
#> data = fake_result_table)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.69243 0.01502 46.097 <2e-16 ***
#> Company_compare -0.02710 0.02324 -1.166 0.2436
#> Country_compare 0.03155 0.01342 2.351 0.0187 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 127258 on 99999 degrees of freedom
#> Residual deviance: 127251 on 99997 degrees of freedom
#> AIC: 127257
#>
#> Number of Fisher Scoring iterations: 4
result <- merge_plus(corp_data1_test, corp_data2_test,
match_type = "multivar",
multivar_settings = build_multivar_settings(logit = logit_model, compare_type = c("indicator", "stringdist"),
wgts = NULL, nthread = 1),
by = c("Country", "Company"), unique_key_1 = "id_1",
unique_key_2 = "id_2",
suffixes = c("_1", "_2")
)
result
#> $matches
#> unique_key_1 Company_1 Country_1 State SIC Revenue id_1
#> 1: 4 Exxon Mobile USA TX 2222 205 4
#> 2: 5 McKesson Germany MA 222 192 5
#> 3: 6 UnitedHealth Group USA MA NA 184 6
#> 4: 9 AT&T USA TN 4000 163 9
#> 5: 2 Bershire Hataway USA 2222 223 2
#> 6: 7 CVS Health USA RI 1112 177 7
#> 7: 10 Ford Motor Company USA MI NA 151 10
#> 8: 1 Walmart USA OH 3300 485 1
#> 9: 8 General Motors USA MI 2222 166 8
#> 10: 3 Apple USA CA 3384 215 3
#> Name country state_code SIC_code earnings unique_key_2 id_2
#> 1: Walmart USA OH 3380 490,000 1 1
#> 2: Walmart USA OH 3380 490,000 1 1
#> 3: Walmart USA OH 3380 490,000 1 1
#> 4: Walmart USA OH 3380 490,000 1 1
#> 5: Exxon Mobile Inc. USA TX 2222 210,000 4 4
#> 6: Exxon Mobile Inc. USA TX 2222 210,000 4 4
#> 7: UnitedHealth Group USA MA 1130 180,000 6 6
#> 8: AT & T USA TN 4000 160,000 9 9
#> 9: AT & T USA TN 4000 160,000 9 9
#> 10: Ford Motor USA MI 2222 150,000 10 10
#> Country_2 Company_2 Country_compare Company_compare multivar_score
#> 1: USA Walmart 1 0.0000000 0.6734819
#> 2: USA Walmart 0 0.0000000 0.6665064
#> 3: USA Walmart 1 0.3650794 0.6713026
#> 4: USA Walmart 1 0.0000000 0.6734819
#> 5: USA Exxon Mobile Inc. 1 0.2847222 0.6717829
#> 6: USA Exxon Mobile Inc. 1 0.3777778 0.6712266
#> 7: USA UnitedHealth Group 1 0.4497354 0.6707962
#> 8: USA AT & T 1 0.0000000 0.6734819
#> 9: USA AT & T 1 0.4126984 0.6710178
#> 10: USA Ford Motor 1 0.0000000 0.6734819
#> tier
#> 1: all
#> 2: all
#> 3: all
#> 4: all
#> 5: all
#> 6: all
#> 7: all
#> 8: all
#> 9: all
#> 10: all
#>
#> $matches_filter
#> Null data.table (0 rows and 0 cols)
#>
#> $data1_nomatch
#> Empty data.table (0 rows and 7 cols): Company,Country,State,SIC,Revenue,unique_key_1...
#>
#> $data2_nomatch
#> Name country state_code SIC_code earnings unique_key_2 id_2
#> 1: Bershire Hathaway USA NE 2220 220,000 2 2
#> 2: Apple Computer USA CA NA 220,000 3 3
#> 3: McKesson Corp. MA 2222 190,000 5 5
#> 4: CVS RI 1122 180,000 7 7
#> 5: GM MI 2222 170,000 8 8
#> Country Company
#> 1: USA Bershire Hathaway
#> 2: USA Apple Computer
#> 3: McKesson Corp.
#> 4: CVS
#> 5: GM
#>
#> $match_evaluation
#> tier matches in_tier_unique_1 in_tier_unique_2 pct_matched_1 pct_matched_2
#> 1: all 10 10 5 1 0.5
#> new_unique_1 new_unique_2
#> 1: 10 5
Instead of using a weighted sum of comparison scores, we use
base::predict
to arrive at a multivar_score that can be
interpreted as a match probability. A word of caution: the logit
parameters can give very non-intuitive results for the resultant
multivar_scores, because the logit function is of
course non-linear. For example, in one project we were working on, there
was a coefficient of 40 and a coefficient of 3. However, if the variable
with the coefficient of ‘3’ had a comparison score of 0, then the
highest the multivar_score could be was only around 0.6! This was very
unintuitive for us, and shows the complex nature of the logit model.
All that said, the logit model has the benefit of returning an easily-interpretable multivar_score: with a large enough (and clean enough) training set, the fitted multivar_score is simply the probability of a match. This is contrast to the multivar_scores returned by the linear combination: these numbers have no immediate meaning, and can vary dramatically for each project. In one project, a score of .6 might be very good, and in another, very bad. It is a good idea to try both the logit and linear methods and play around with each for your individual project.
Because this method of matching relies on comparing each row of a data set to each other row of a dataset, the time needed to perform a match grows very quickly as the size of each data set increases. There are several ways to help with this:
Here’s how the blocking works:
corp_data1_test <- copy(corp_data1)
data.table::setDT(corp_data1_test)
corp_data2_test <- copy(corp_data2)
data.table::setDT(corp_data2_test)
corp_data1_test[, id_1 := seq(1, .N)]
corp_data2_test[, id_2 := seq(1, .N)]
corp_data2_test[, Country := country]
corp_data2_test[, Company := Name]
result <- merge_plus(
data1 = corp_data1_test,
data2 = corp_data2_test,
match_type = "multivar",
by = c("Company"),
suffixes = c("_1", "_2"),
unique_key_1 = "id_1",
unique_key_2 = "id_2",
multivar_settings = build_multivar_settings(
compare_type = c( "stringdist"),
wgts = c(1), nthread = 1, blocks = "Country"
))
result$matches
#> unique_key_1 Country Company_1 State SIC Revenue id_1
#> 1: 1 USA Walmart OH 3300 485 1
#> 2: 2 USA Bershire Hataway 2222 223 2
#> 3: 3 USA Apple CA 3384 215 3
#> 4: 4 USA Exxon Mobile TX 2222 205 4
#> 5: 6 USA UnitedHealth Group MA NA 184 6
#> 6: 7 USA CVS Health RI 1112 177 7
#> 7: 9 USA AT&T TN 4000 163 9
#> 8: 8 USA General Motors MI 2222 166 8
#> 9: 10 USA Ford Motor Company MI NA 151 10
#> Name country state_code SIC_code earnings unique_key_2 id_2
#> 1: Walmart USA OH 3380 490,000 1 1
#> 2: Bershire Hathaway USA NE 2220 220,000 2 2
#> 3: Apple Computer USA CA NA 220,000 3 3
#> 4: Exxon Mobile Inc. USA TX 2222 210,000 4 4
#> 5: UnitedHealth Group USA MA 1130 180,000 6 6
#> 6: UnitedHealth Group USA MA 1130 180,000 6 6
#> 7: AT & T USA TN 4000 160,000 9 9
#> 8: Ford Motor USA MI 2222 150,000 10 10
#> 9: Ford Motor USA MI 2222 150,000 10 10
#> Company_2 Company_compare multivar_score tier
#> 1: Walmart 1.0000000 1.0000000 all
#> 2: Bershire Hathaway 0.9882353 0.9882353 all
#> 3: Apple Computer 0.8714286 0.8714286 all
#> 4: Exxon Mobile Inc. 0.9333333 0.9333333 all
#> 5: UnitedHealth Group 1.0000000 1.0000000 all
#> 6: UnitedHealth Group 0.5333333 0.5333333 all
#> 7: AT & T 0.9111111 0.9111111 all
#> 8: Ford Motor 0.7333333 0.7333333 all
#> 9: Ford Motor 0.9111111 0.9111111 all
For our dummy example data, this isn’t that useful (we just drop one observation with the Country being Germany). But, if there is a field in the data that is well-populated and trusted, it can be used to dramatically cut down on the number of observations.
Parallelization is implemented with the nthread
argument
and parallel::parLapply()
. The data is split into groups
equal to nthread
, and then each core does its comparison,
all at the same time. This of course can dramatically speed up the
process, but requires more memory and computing power. For details on
the parallel process, see parallel::parLapply
and
parallel::makeCluster
.