This function finds the element indices of partial matching or similar strings in a character vector. Can be used to find exact or slightly mistyped elements in a string vector.
str_find(string, pattern, precision = 2, partial = 0, verbose = FALSE)
Character vector with string elements.
String that should be matched against the elements of string
.
Maximum distance ("precision") between two string elements, which is allowed to treat them as similar or equal. Smaller values mean less tolerance in matching.
Activates similar matching (close distance strings) for parts (substrings)
of the string
. Following values are accepted:
0 for no partial distance matching
1 for one-step matching, which means, only substrings of same length as pattern
are extracted from string
matching
2 for two-step matching, which means, substrings of same length as pattern
as well as strings with a slightly wider range are extracted from string
matching
Default value is 0. See 'Details' for more information.
Logical; if TRUE
, the progress bar is displayed when computing the distance matrix.
Default in FALSE
, hence the bar is hidden.
A numeric vector with index position of elements in string
that
partially match or are similar to pattern
. Returns -1
if no
match was found.
Computation Details
Fuzzy string matching is based on regular expressions, in particular
grep(pattern = "(<pattern>){~<precision>}", x = string)
. This
means, precision
indicates the number of chars inside pattern
that may differ in string
to cosinder it as "matching". The higher
precision
is, the more tolerant is the search (i.e. yielding more
possible matches). Furthermore, the higher the value for partial
is, the more matches may be found.
Partial Distance Matching
For partial = 1
, a substring of length(pattern)
is extracted
from string
, starting at position 0 in string
until
the end of string
is reached. Each substring is matched against
pattern
, and results with a maximum distance of precision
are considered as "matching". If partial = 2
, the range
of the extracted substring is increased by 2, i.e. the extracted substring
is two chars longer and so on.
This function does not return the position of a matching string inside
another string, but the element's index of the string
vector, where
a (partial) match with pattern
was found. Thus, searching for "abc" in
a string "this is abc" will not return 9 (the start position of the substring),
but 1 (the element index, which is always 1 if string
only has one element).
string <- c("Hello", "Helo", "Hole", "Apple", "Ape", "New", "Old", "System", "Systemic")
str_find(string, "hel") # partial match
#> [1] 1 2 3 5 6 7
str_find(string, "stem") # partial match
#> [1] 8 9
str_find(string, "R") # no match
#> [1] -1
str_find(string, "saste") # similarity to "System"
#> [1] 8
# finds two indices, because partial matching now
# also applies to "Systemic"
str_find(string,
"sytsme",
partial = 1)
#> [1] 8 9
# finds partial matching of similarity
str_find("We are Sex Pistols!", "postils")
#> [1] 1