This function finds the element indices of partial matching or similar strings in a character vector. Can be used to find exact or slightly mistyped elements in a string vector.
str_find(string, pattern, precision = 2, partial = 0, verbose = FALSE)Character vector with string elements.
String that should be matched against the elements of string.
Maximum distance ("precision") between two string elements, which is allowed to treat them as similar or equal. Smaller values mean less tolerance in matching.
Activates similar matching (close distance strings) for parts (substrings)
of the string. Following values are accepted:
0 for no partial distance matching
1 for one-step matching, which means, only substrings of same length as pattern are extracted from string matching
2 for two-step matching, which means, substrings of same length as pattern as well as strings with a slightly wider range are extracted from string matching
Default value is 0. See 'Details' for more information.
Logical; if TRUE, the progress bar is displayed when computing the distance matrix.
Default in FALSE, hence the bar is hidden.
A numeric vector with index position of elements in string that
partially match or are similar to pattern. Returns -1 if no
match was found.
Computation Details
Fuzzy string matching is based on regular expressions, in particular
grep(pattern = "(<pattern>){~<precision>}", x = string). This
means, precision indicates the number of chars inside pattern
that may differ in string to cosinder it as "matching". The higher
precision is, the more tolerant is the search (i.e. yielding more
possible matches). Furthermore, the higher the value for partial
is, the more matches may be found.
Partial Distance Matching
For partial = 1, a substring of length(pattern) is extracted
from string, starting at position 0 in string until
the end of string is reached. Each substring is matched against
pattern, and results with a maximum distance of precision
are considered as "matching". If partial = 2, the range
of the extracted substring is increased by 2, i.e. the extracted substring
is two chars longer and so on.
This function does not return the position of a matching string inside
another string, but the element's index of the string vector, where
a (partial) match with pattern was found. Thus, searching for "abc" in
a string "this is abc" will not return 9 (the start position of the substring),
but 1 (the element index, which is always 1 if string only has one element).
string <- c("Hello", "Helo", "Hole", "Apple", "Ape", "New", "Old", "System", "Systemic")
str_find(string, "hel") # partial match
#> [1] 1 2 3 5 6 7
str_find(string, "stem") # partial match
#> [1] 8 9
str_find(string, "R") # no match
#> [1] -1
str_find(string, "saste") # similarity to "System"
#> [1] 8
# finds two indices, because partial matching now
# also applies to "Systemic"
str_find(string,
"sytsme",
partial = 1)
#> [1] 8 9
# finds partial matching of similarity
str_find("We are Sex Pistols!", "postils")
#> [1] 1