This function groups elements of a string vector (character or string variable) according to the element's distance ('similatiry'). The more similar two string elements are, the higher is the chance to be combined into a group.
group_str(
strings,
precision = 2,
strict = FALSE,
trim.whitespace = TRUE,
remove.empty = TRUE,
verbose = FALSE,
maxdist
)Character vector with string elements.
Maximum distance ("precision") between two string elements, which is allowed to treat them as similar or equal. Smaller values mean less tolerance in matching.
Logical; if TRUE, value matching is more strictly. See 'Examples'.
Logical; if TRUE (default), leading and trailing white spaces will
be removed from string values.
Logical; if TRUE (default), empty string values will be removed from the
character vector strings.
Logical; if TRUE, the progress bar is displayed when computing the distance matrix.
Default in FALSE, hence the bar is hidden.
Deprecated. Please use precision now.
A character vector where similar string elements (values) are recoded
into a new, single value. The return value is of same length as
strings, i.e. grouped elements appear multiple times, so
the count for each grouped string is still avaiable (see 'Examples').
oldstring <- c("Hello", "Helo", "Hole", "Apple",
"Ape", "New", "Old", "System", "Systemic")
newstring <- group_str(oldstring)
# see result
newstring
#> [1] "Hello, Helo" "Hello, Helo" "Hole" "Ape, Apple"
#> [5] "Ape, Apple" "New" "Old" "System, Systemic"
#> [9] "System, Systemic"
# count for each groups
table(newstring)
#> newstring
#> Ape, Apple Hello, Helo Hole New
#> 2 2 1 1
#> Old System, Systemic
#> 1 2
# print table to compare original and grouped string
frq(oldstring)
#> x <character>
#> # total N=9 valid N=9 mean=5.00 sd=2.74
#>
#> Value | N | Raw % | Valid % | Cum. %
#> ---------------------------------------
#> Ape | 1 | 11.11 | 11.11 | 11.11
#> Apple | 1 | 11.11 | 11.11 | 22.22
#> Hello | 1 | 11.11 | 11.11 | 33.33
#> Helo | 1 | 11.11 | 11.11 | 44.44
#> Hole | 1 | 11.11 | 11.11 | 55.56
#> New | 1 | 11.11 | 11.11 | 66.67
#> Old | 1 | 11.11 | 11.11 | 77.78
#> System | 1 | 11.11 | 11.11 | 88.89
#> Systemic | 1 | 11.11 | 11.11 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
frq(newstring)
#> x <character>
#> # total N=9 valid N=9 mean=3.33 sd=2.00
#>
#> Value | N | Raw % | Valid % | Cum. %
#> -----------------------------------------------
#> Ape, Apple | 2 | 22.22 | 22.22 | 22.22
#> Hello, Helo | 2 | 22.22 | 22.22 | 44.44
#> Hole | 1 | 11.11 | 11.11 | 55.56
#> New | 1 | 11.11 | 11.11 | 66.67
#> Old | 1 | 11.11 | 11.11 | 77.78
#> System, Systemic | 2 | 22.22 | 22.22 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
# larger groups
newstring <- group_str(oldstring, precision = 3)
frq(oldstring)
#> x <character>
#> # total N=9 valid N=9 mean=5.00 sd=2.74
#>
#> Value | N | Raw % | Valid % | Cum. %
#> ---------------------------------------
#> Ape | 1 | 11.11 | 11.11 | 11.11
#> Apple | 1 | 11.11 | 11.11 | 22.22
#> Hello | 1 | 11.11 | 11.11 | 33.33
#> Helo | 1 | 11.11 | 11.11 | 44.44
#> Hole | 1 | 11.11 | 11.11 | 55.56
#> New | 1 | 11.11 | 11.11 | 66.67
#> Old | 1 | 11.11 | 11.11 | 77.78
#> System | 1 | 11.11 | 11.11 | 88.89
#> Systemic | 1 | 11.11 | 11.11 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
frq(newstring)
#> x <character>
#> # total N=9 valid N=9 mean=2.44 sd=1.13
#>
#> Value | N | Raw % | Valid % | Cum. %
#> ------------------------------------------------
#> Ape, Apple | 2 | 22.22 | 22.22 | 22.22
#> Hello, Helo, Hole | 3 | 33.33 | 33.33 | 55.56
#> New, Old | 2 | 22.22 | 22.22 | 77.78
#> System, Systemic | 2 | 22.22 | 22.22 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
# be more strict with matching pairs
newstring <- group_str(oldstring, precision = 3, strict = TRUE)
frq(oldstring)
#> x <character>
#> # total N=9 valid N=9 mean=5.00 sd=2.74
#>
#> Value | N | Raw % | Valid % | Cum. %
#> ---------------------------------------
#> Ape | 1 | 11.11 | 11.11 | 11.11
#> Apple | 1 | 11.11 | 11.11 | 22.22
#> Hello | 1 | 11.11 | 11.11 | 33.33
#> Helo | 1 | 11.11 | 11.11 | 44.44
#> Hole | 1 | 11.11 | 11.11 | 55.56
#> New | 1 | 11.11 | 11.11 | 66.67
#> Old | 1 | 11.11 | 11.11 | 77.78
#> System | 1 | 11.11 | 11.11 | 88.89
#> Systemic | 1 | 11.11 | 11.11 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
frq(newstring)
#> x <character>
#> # total N=9 valid N=9 mean=2.89 sd=1.54
#>
#> Value | N | Raw % | Valid % | Cum. %
#> -----------------------------------------------
#> Ape, Apple | 2 | 22.22 | 22.22 | 22.22
#> Hello, Helo | 2 | 22.22 | 22.22 | 44.44
#> Hole, Old | 2 | 22.22 | 22.22 | 66.67
#> New | 1 | 11.11 | 11.11 | 77.78
#> System, Systemic | 2 | 22.22 | 22.22 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>