Using-clean-strings.Rmd
library(fedmatch)
clean_strings
is the way to prepare strings for name
matching, either within tier_match
(see the
Using-tier-match
vignette). There are several useful
options that allow for many different options.
Here’s the example string we’ll be using:
name_vec <- corp_data1[, Company]
name_vec
#> [1] "Walmart" "Bershire Hataway" "Apple"
#> [4] "Exxon Mobile" "McKesson " "UnitedHealth Group"
#> [7] "CVS Health" "General Motors" "AT&T"
#> [10] "Ford Motor Company"
First, we can use the basic string cleaning defaults:
clean_strings(name_vec)
#> [1] "walmart" "bershire hataway" "apple"
#> [4] "exxon mobile" "mckesson" "unitedhealth group"
#> [7] "cvs health" "general motors" "atandt"
#> [10] "ford motor company"
Without any additional arguments, clean_strings
does the
following:
Then, we have a few different options we can use.
sp_char_words
is a data.frame with 2 columns: the first
column is symbols to replace, and the second is their replacement.
fedmatch
as a built-in set of symbols:
print(sp_char_words)
#> character replacement
#> 1: \\& and
#> 2: \\$ dollar
#> 3: \\% percent
#> 4: \\@ at
But, you can use any data.frame you’d like, to make whatever replacements you’d like:
new_sp_char <- data.table::data.table(character = c("o"), replacement = c("apple"))
clean_strings(name_vec, sp_char_words = new_sp_char)
#> [1] "walmart" "bershire hataway"
#> [3] "apple" "exxapplen mapplebile"
#> [5] "mckessapplen" "unitedhealth grappleup"
#> [7] "cvs health" "general mappletapplers"
#> [9] "at t" "fapplerd mappletappler capplempany"
common_words
is similar, but it respects word boundaries
(so you don’t replace every usage of ‘Corp’ with ‘Corporation’, for
example.) fedmatch
has a built-in set of 54 words and their
replacements:
print(corporate_words[1:5])
#> abbr long.names
#> 1: accep acceptance
#> 2: amer america
#> 3: assoc associates
#> 4: cl company listed
#> 5: cmnty community
But, you can use whatever words you’d like:
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "almart"),
replacement = c("bananas", "oranges")))
#> [1] "walmart" "bershire hataway" "apple"
#> [4] "exxon mobile" "mckesson" "unitedhealth group"
#> [7] "cvs health" "bananas motors" "atandt"
#> [10] "ford motor company"
(bananas motors sounds like a lovely place to work). Note that the ‘almart’ in ‘walmart’ didn’t get replaced, because common_words respects word boundaries.,
You can also use a related function, word_frequency
, to
look for the most common strings in your data:
word_frequency(sample(c("hi", "Hello", "bye "), 1e4, replace = TRUE))
#> Word Count
#> 1: hi 3380
#> 2: hello 3326
#> 3: bye 3294
remove_words and remove_char are booleans that let you simply remove the words in ‘common_words’ or specify a set of characters to remove rather than replacing them.
clean_strings(name_vec, sp_char_words = new_sp_char, remove_char = c("a", "c"))
#> [1] "w lm rt" "bershire h t w y"
#> [3] "pple" "exxapplen mapplebile"
#> [5] "m kessapplen" "unitedhe lth grappleup"
#> [7] "vs he lth" "gener l mappletapplers"
#> [9] "t t" "fapplerd mappletappler applemp ny"
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "company"),
replacement = c("bananas", "oranges")),
remove_words = TRUE)
#> [1] "walmart" "bershire hataway" "apple"
#> [4] "exxon mobile" "mckesson" "unitedhealth group"
#> [7] "cvs health" "motors" "atandt"
#> [10] "ford motor"
stem
is a boolean that lets you stem words, using
SnowballC::wordStem
. ‘stemming’ words means removing common
suffixes:
clean_strings(c( "call", "calling", "called"), stem = TRUE)
#> [1] "call" "call" "call"
See the documentation in SnowballC::wordStem
for
details.