This function groups elements of a string vector (character or string variable) according to the element's distance ('similatiry'). The more similar two string elements are, the higher is the chance to be combined into a group.
group_str(
strings,
precision = 2,
strict = FALSE,
trim.whitespace = TRUE,
remove.empty = TRUE,
verbose = FALSE,
maxdist
)
Character vector with string elements.
Maximum distance ("precision") between two string elements, which is allowed to treat them as similar or equal. Smaller values mean less tolerance in matching.
Logical; if TRUE
, value matching is more strictly. See 'Examples'.
Logical; if TRUE
(default), leading and trailing white spaces will
be removed from string values.
Logical; if TRUE
(default), empty string values will be removed from the
character vector strings
.
Logical; if TRUE
, the progress bar is displayed when computing the distance matrix.
Default in FALSE
, hence the bar is hidden.
Deprecated. Please use precision
now.
A character vector where similar string elements (values) are recoded
into a new, single value. The return value is of same length as
strings
, i.e. grouped elements appear multiple times, so
the count for each grouped string is still avaiable (see 'Examples').
oldstring <- c("Hello", "Helo", "Hole", "Apple",
"Ape", "New", "Old", "System", "Systemic")
newstring <- group_str(oldstring)
# see result
newstring
#> [1] "Hello, Helo" "Hello, Helo" "Hole" "Ape, Apple"
#> [5] "Ape, Apple" "New" "Old" "System, Systemic"
#> [9] "System, Systemic"
# count for each groups
table(newstring)
#> newstring
#> Ape, Apple Hello, Helo Hole New
#> 2 2 1 1
#> Old System, Systemic
#> 1 2
# print table to compare original and grouped string
frq(oldstring)
#> x <character>
#> # total N=9 valid N=9 mean=5.00 sd=2.74
#>
#> Value | N | Raw % | Valid % | Cum. %
#> ---------------------------------------
#> Ape | 1 | 11.11 | 11.11 | 11.11
#> Apple | 1 | 11.11 | 11.11 | 22.22
#> Hello | 1 | 11.11 | 11.11 | 33.33
#> Helo | 1 | 11.11 | 11.11 | 44.44
#> Hole | 1 | 11.11 | 11.11 | 55.56
#> New | 1 | 11.11 | 11.11 | 66.67
#> Old | 1 | 11.11 | 11.11 | 77.78
#> System | 1 | 11.11 | 11.11 | 88.89
#> Systemic | 1 | 11.11 | 11.11 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
frq(newstring)
#> x <character>
#> # total N=9 valid N=9 mean=3.33 sd=2.00
#>
#> Value | N | Raw % | Valid % | Cum. %
#> -----------------------------------------------
#> Ape, Apple | 2 | 22.22 | 22.22 | 22.22
#> Hello, Helo | 2 | 22.22 | 22.22 | 44.44
#> Hole | 1 | 11.11 | 11.11 | 55.56
#> New | 1 | 11.11 | 11.11 | 66.67
#> Old | 1 | 11.11 | 11.11 | 77.78
#> System, Systemic | 2 | 22.22 | 22.22 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
# larger groups
newstring <- group_str(oldstring, precision = 3)
frq(oldstring)
#> x <character>
#> # total N=9 valid N=9 mean=5.00 sd=2.74
#>
#> Value | N | Raw % | Valid % | Cum. %
#> ---------------------------------------
#> Ape | 1 | 11.11 | 11.11 | 11.11
#> Apple | 1 | 11.11 | 11.11 | 22.22
#> Hello | 1 | 11.11 | 11.11 | 33.33
#> Helo | 1 | 11.11 | 11.11 | 44.44
#> Hole | 1 | 11.11 | 11.11 | 55.56
#> New | 1 | 11.11 | 11.11 | 66.67
#> Old | 1 | 11.11 | 11.11 | 77.78
#> System | 1 | 11.11 | 11.11 | 88.89
#> Systemic | 1 | 11.11 | 11.11 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
frq(newstring)
#> x <character>
#> # total N=9 valid N=9 mean=2.44 sd=1.13
#>
#> Value | N | Raw % | Valid % | Cum. %
#> ------------------------------------------------
#> Ape, Apple | 2 | 22.22 | 22.22 | 22.22
#> Hello, Helo, Hole | 3 | 33.33 | 33.33 | 55.56
#> New, Old | 2 | 22.22 | 22.22 | 77.78
#> System, Systemic | 2 | 22.22 | 22.22 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
# be more strict with matching pairs
newstring <- group_str(oldstring, precision = 3, strict = TRUE)
frq(oldstring)
#> x <character>
#> # total N=9 valid N=9 mean=5.00 sd=2.74
#>
#> Value | N | Raw % | Valid % | Cum. %
#> ---------------------------------------
#> Ape | 1 | 11.11 | 11.11 | 11.11
#> Apple | 1 | 11.11 | 11.11 | 22.22
#> Hello | 1 | 11.11 | 11.11 | 33.33
#> Helo | 1 | 11.11 | 11.11 | 44.44
#> Hole | 1 | 11.11 | 11.11 | 55.56
#> New | 1 | 11.11 | 11.11 | 66.67
#> Old | 1 | 11.11 | 11.11 | 77.78
#> System | 1 | 11.11 | 11.11 | 88.89
#> Systemic | 1 | 11.11 | 11.11 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>
frq(newstring)
#> x <character>
#> # total N=9 valid N=9 mean=2.89 sd=1.54
#>
#> Value | N | Raw % | Valid % | Cum. %
#> -----------------------------------------------
#> Ape, Apple | 2 | 22.22 | 22.22 | 22.22
#> Hello, Helo | 2 | 22.22 | 22.22 | 44.44
#> Hole, Old | 2 | 22.22 | 22.22 | 66.67
#> New | 1 | 11.11 | 11.11 | 77.78
#> System, Systemic | 2 | 22.22 | 22.22 | 100.00
#> <NA> | 0 | 0.00 | <NA> | <NA>