Recode variables

rec() recodes values of variables, where variable selection is based on variable names or column position, or on select helpers (see documentation on ...). rec_if() is a scoped variant of rec(), where recoding will be applied only to those variables that match the logical condition of predicate.

rec(
  x,
  ...,
  rec,
  as.num = TRUE,
  var.label = NULL,
  val.labels = NULL,
  append = TRUE,
  suffix = "_r",
  to.factor = !as.num
)

rec_if(
  x,
  predicate,
  rec,
  as.num = TRUE,
  var.label = NULL,
  val.labels = NULL,
  append = TRUE,
  suffix = "_r",
  to.factor = !as.num
)

Arguments

x

A vector or data frame.

...

Optional, unquoted names of variables that should be selected for further processing. Required, if x is a data frame (and no vector) and only selected variables from x should be processed. You may also use functions like : or tidyselect's select-helpers. See 'Examples' or package-vignette.

rec

String with recode pairs of old and new values. See 'Details' for examples. rec_pattern is a convenient function to create recode strings for grouping variables.

as.num

Logical, if TRUE, return value will be numeric, not a factor.

var.label

Optional string, to set variable label attribute for the returned variable (see vignette Labelled Data and the sjlabelled-Package). If NULL (default), variable label attribute of x will be used (if present). If empty, variable label attributes will be removed.

val.labels

Optional character vector, to set value label attributes of recoded variable (see vignette Labelled Data and the sjlabelled-Package). If NULL (default), no value labels will be set. Value labels can also be directly defined in the rec-syntax, see 'Details'.

append

Logical, if TRUE (the default) and x is a data frame, x including the new variables as additional columns is returned; if FALSE, only the new variables are returned.

suffix

String value, will be appended to variable (column) names of x, if x is a data frame. If x is not a data frame, this argument will be ignored. The default value to suffix column names in a data frame depends on the function call:

recoded variables (rec()) will be suffixed with "_r"
recoded variables (recode_to()) will be suffixed with "_r0"
dichotomized variables (dicho()) will be suffixed with "_d"
grouped variables (split_var()) will be suffixed with "_g"
grouped variables (group_var()) will be suffixed with "_gr"
standardized variables (std()) will be suffixed with "_z"
centered variables (center()) will be suffixed with "_c"

If suffix = "" and append = TRUE, existing variables that have been recoded/transformed will be overwritten.

to.factor

Logical, alias for as.num. If TRUE, return value will be a factor, not numeric.

predicate

A predicate function to be applied to the columns. The variables for which predicate returns TRUE are selected.

Value

x with recoded categories. If x is a data frame, for append = TRUE, x including the recoded variables as new columns is returned; if append = FALSE, only the recoded variables will be returned. If append = TRUE and suffix = "", recoded variables will replace (overwrite) existing variables.

Details

The rec string has following syntax:

recode pairs: each recode pair has to be separated by a ;, e.g. rec = "1=1; 2=4; 3=2; 4=3"
multiple values: multiple old values that should be recoded into a new single value may be separated with comma, e.g. "1,2=1; 3,4=2"
value range: a value range is indicated by a colon, e.g. "1:4=1; 5:8=2" (recodes all values from 1 to 4 into 1, and from 5 to 8 into 2)
value range for doubles: for double vectors (with fractional part), all values within the specified range are recoded; e.g. 1:2.5=1;2.6:3=2 recodes 1 to 2.5 into 1 and 2.6 to 3 into 2, but 2.55 would not be recoded (since it's not included in any of the specified ranges)
"min" and "max": minimum and maximum values are indicates by min (or lo) and max (or hi), e.g. "min:4=1; 5:max=2" (recodes all values from minimum values of x to 4 into 1, and from 5 to maximum values of x into 2)
"else": all other values, which have not been specified yet, are indicated by else, e.g. "3=1; 1=2; else=3" (recodes 3 into 1, 1 into 2 and all other values into 3)
"copy": the "else"-token can be combined with copy, indicating that all remaining, not yet recoded values should stay the same (are copied from the original value), e.g. "3=1; 1=2; else=copy" (recodes 3 into 1, 1 into 2 and all other values like 2, 4 or 5 etc. will not be recoded, but copied, see 'Examples')
NA's: NA values are allowed both as old and new value, e.g. "NA=1; 3:5=NA" (recodes all NA into 1, and all values from 3 to 5 into NA in the new variable)
"rev": "rev" is a special token that reverses the value order (see 'Examples')
direct value labelling: value labels for new values can be assigned inside the recode pattern by writing the value label in square brackets after defining the new value in a recode pair, e.g. "15:30=1 [young aged]; 31:55=2 [middle aged]; 56:max=3 [old aged]". See 'Examples'.

Note

Please note following behaviours of the function:

the "else"-token should always be the last argument in the rec-string.
Non-matching values will be set to NA, unless captured by the "else"-token.
Tagged NA values (see tagged_na) and their value labels will be preserved when copying NA values to the recoded vector with "else=copy".
Variable label attributes (see, for instance, get_label) are preserved (unless changed via var.label-argument), however, value label attributes are removed (except for "rev", where present value labels will be automatically reversed as well). Use val.labels-argument to add labels for recoded values.
If x is a data frame, all variables should have the same categories resp. value range (else, see second bullet, NAs are produced).

Examples

data(efc)
table(efc$e42dep, useNA = "always")
#> 
#>    1    2    3    4 <NA> 
#>   66  225  306  304    7 

# replace NA with 5
table(rec(efc$e42dep, rec = "1=1;2=2;3=3;4=4;NA=5"), useNA = "always")
#> 
#>    1    2    3    4    5 <NA> 
#>   66  225  306  304    7    0 

# recode 1 to 2 into 1 and 3 to 4 into 2
table(rec(efc$e42dep, rec = "1,2=1; 3,4=2"), useNA = "always")
#> 
#>    1    2 <NA> 
#>  291  610    7 

# keep value labels. variable label is automatically preserved
library(dplyr)
efc %>%
  select(e42dep) %>%
  rec(rec = "1,2=1; 3,4=2",
      val.labels = c("low dependency", "high dependency")) %>%
  frq()
#> elder's dependency (e42dep) <numeric> 
#> # total N=908 valid N=901 mean=2.94 sd=0.94
#> 
#> Value |                Label |   N | Raw % | Valid % | Cum. %
#> -------------------------------------------------------------
#>     1 |          independent |  66 |  7.27 |    7.33 |   7.33
#>     2 |   slightly dependent | 225 | 24.78 |   24.97 |  32.30
#>     3 | moderately dependent | 306 | 33.70 |   33.96 |  66.26
#>     4 |   severely dependent | 304 | 33.48 |   33.74 | 100.00
#>  <NA> |                 <NA> |   7 |  0.77 |    <NA> |   <NA>
#> 
#> elder's dependency (e42dep_r) <numeric> 
#> # total N=908 valid N=901 mean=1.68 sd=0.47
#> 
#> Value |           Label |   N | Raw % | Valid % | Cum. %
#> --------------------------------------------------------
#>     1 |  low dependency | 291 | 32.05 |   32.30 |  32.30
#>     2 | high dependency | 610 | 67.18 |   67.70 | 100.00
#>  <NA> |            <NA> |   7 |  0.77 |    <NA> |   <NA>

# works with mutate
efc %>%
  select(e42dep, e17age) %>%
  mutate(dependency_rev = rec(e42dep, rec = "rev")) %>%
  head()
#>   e42dep e17age dependency_rev
#> 1      3     83              2
#> 2      3     88              2
#> 3      3     82              2
#> 4      4     67              1
#> 5      4     84              1
#> 6      4     85              1

# recode 1 to 3 into 1 and 4 into 2
table(rec(efc$e42dep, rec = "min:3=1; 4=2"), useNA = "always")
#> 
#>    1    2 <NA> 
#>  597  304    7 

# recode 2 to 1 and all others into 2
table(rec(efc$e42dep, rec = "2=1; else=2"), useNA = "always")
#> 
#>    1    2 <NA> 
#>  225  676    7 

# reverse value order
table(rec(efc$e42dep, rec = "rev"), useNA = "always")
#> 
#>    1    2    3    4 <NA> 
#>  304  306  225   66    7 

# recode only selected values, copy remaining
table(efc$e15relat)
#> 
#>   1   2   3   4   5   6   7   8 
#> 171 473  29  85  23  22   6  92 
table(rec(efc$e15relat, rec = "1,2,4=1; else=copy"))
#> 
#>   1   3   5   6   7   8 
#> 729  29  23  22   6  92 

# recode variables with same category in a data frame
head(efc[, 6:9])
#>   c82cop1 c83cop2 c84cop3 c85cop4
#> 1       3       2       2       2
#> 2       3       3       3       3
#> 3       2       2       1       4
#> 4       4       1       3       1
#> 5       3       2       1       2
#> 6       2       2       3       3
head(rec(efc[, 6:9], rec = "1=10;2=20;3=30;4=40"))
#>   c82cop1 c83cop2 c84cop3 c85cop4 c82cop1_r c83cop2_r c84cop3_r c85cop4_r
#> 1       3       2       2       2        30        20        20        20
#> 2       3       3       3       3        30        30        30        30
#> 3       2       2       1       4        20        20        10        40
#> 4       4       1       3       1        40        10        30        10
#> 5       3       2       1       2        30        20        10        20
#> 6       2       2       3       3        20        20        30        30

# recode multiple variables and set value labels via recode-syntax
dummy <- rec(
  efc, c160age, e17age,
  rec = "15:30=1 [young]; 31:55=2 [middle]; 56:max=3 [old]",
  append = FALSE
)
frq(dummy)
#> carer' age (c160age_r) <numeric> 
#> # total N=908 valid N=901 mean=2.40 sd=0.59
#> 
#> Value |  Label |   N | Raw % | Valid % | Cum. %
#> -----------------------------------------------
#>     1 |  young |  48 |  5.29 |    5.33 |   5.33
#>     2 | middle | 442 | 48.68 |   49.06 |  54.38
#>     3 |    old | 411 | 45.26 |   45.62 | 100.00
#>  <NA> |   <NA> |   7 |  0.77 |    <NA> |   <NA>
#> 
#> elder' age (e17age_r) <numeric> 
#> # total N=908 valid N=891 mean=3.00 sd=0.00
#> 
#> Value |  Label |   N | Raw % | Valid % | Cum. %
#> -----------------------------------------------
#>     1 |  young |   0 |  0.00 |       0 |      0
#>     2 | middle |   0 |  0.00 |       0 |      0
#>     3 |    old | 891 | 98.13 |     100 |    100
#>  <NA> |   <NA> |  17 |  1.87 |    <NA> |   <NA>

# recode variables with same value-range
lapply(
  rec(
    efc, c82cop1, c83cop2, c84cop3,
    rec = "1,2=1; NA=9; else=copy",
    append = FALSE
  ),
  table,
  useNA = "always"
)
#> $c82cop1_r
#> 
#>    1    3    4    9 <NA> 
#>  100  591  210    7    0 
#> 
#> $c83cop2_r
#> 
#>    1    3    4    9 <NA> 
#>  733  130   39    6    0 
#> 
#> $c84cop3_r
#> 
#>    1    3    4    9 <NA> 
#>  768   82   52    6    0 
#> 

# recode character vector
dummy <- c("M", "F", "F", "X")
rec(dummy, rec = "M=Male; F=Female; X=Refused")
#> [1] "Male"    "Female"  "Female"  "Refused"

# recode numeric to character
rec(efc$e42dep, rec = "1=first;2=2nd;3=third;else=hi") %>% head()
#> [1] "third" "third" "third" "hi"    "hi"    "hi"   

# recode non-numeric factors
data(iris)
table(rec(iris, Species, rec = "setosa=huhu; else=copy", append = FALSE))
#> Species_r
#>       huhu versicolor  virginica 
#>         50         50         50 

# recode floating points
table(rec(
  iris, Sepal.Length, rec = "lo:5=1;5.01:6.5=2;6.501:max=3", append = FALSE
))
#> Sepal.Length_r
#>  1  2  3 
#> 32 88 30 

# preserve tagged NAs
if (require("haven")) {
  x <- labelled(c(1:3, tagged_na("a", "c", "z"), 4:1),
                c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"),
                  "Refused" = tagged_na("a"), "Not home" = tagged_na("z")))
  # get current value labels
  x
  # recode 2 into 5; Values of tagged NAs are preserved
  rec(x, rec = "2=5;else=copy")
}
#>  [1]  1  5  3 NA NA NA  4  3  5  1
#> attr(,"labels")
#>              1              4          First        Refused       Not home 
#>    "Agreement" "Disagreement"             NA             NA             NA 

# use select-helpers from dplyr-package
out <- rec(
  efc, contains("cop"), c161sex:c175empl,
  rec = "0,1=0; else=1",
  append = FALSE
)
head(out)
#>   c82cop1_r c83cop2_r c84cop3_r c85cop4_r c86cop5_r c87cop6_r c88cop7_r
#> 1         1         1         1         1         0         0         1
#> 2         1         1         1         1         1         0         1
#> 3         1         1         0         1         0         0         0
#> 4         1         0         1         0         0         0         0
#> 5         1         1         0         1         1         1         0
#> 6         1         1         1         1         1         1         1
#>   c89cop8_r c90cop9_r c161sex_r c172code_r c175empl_r
#> 1         1         1         1          1          0
#> 2         1         1         1          1          0
#> 3         1         1         0          0          0
#> 4         1         1         0          1          0
#> 5         1         1         1          1          0
#> 6         0         0         0          1          0

# recode only variables that have a value range from 1-4
p <- function(x) min(x, na.rm = TRUE) > 0 && max(x, na.rm = TRUE) < 5
out <- rec_if(efc, predicate = p, rec = "1:3=1;4=2;else=copy")
head(out)
#>   c12hour e15relat e16sex e17age e42dep c82cop1 c83cop2 c84cop3 c85cop4 c86cop5
#> 1      16        2      2     83      3       3       2       2       2       1
#> 2     148        2      2     88      3       3       3       3       3       4
#> 3      70        1      2     82      3       2       2       1       4       1
#> 4     168        1      2     67      4       4       1       3       1       1
#> 5     168        2      2     84      4       3       2       1       2       2
#> 6      16        2      2     85      4       2       2       3       3       3
#>   c87cop6 c88cop7 c89cop8 c90cop9 c160age c161sex c172code c175empl barthtot
#> 1       1       2       3       3      56       2        2        1       75
#> 2       1       3       2       2      54       2        2        1       75
#> 3       1       1       4       3      80       1        1        0       35
#> 4       1       1       2       4      69       1        2        0        0
#> 5       2       1       4       4      47       2        2        0       25
#> 6       2       2       1       1      56       1        2        1       60
#>   neg_c_7 pos_v_4 quol_5 resttotn tot_sc_e n4pstu nur_pst e16sex_r e42dep_r
#> 1      12      12     14        0        4      0      NA        1        1
#> 2      20      11     10        4        0      0      NA        1        1
#> 3      11      13      7        0        1      2       2        1        1
#> 4      10      15     12        2        0      3       3        1        2
#> 5      12      15     19        2        1      2       2        1        2
#> 6      19       9      8        1        3      2       2        1        2
#>   c82cop1_r c83cop2_r c84cop3_r c85cop4_r c86cop5_r c87cop6_r c88cop7_r
#> 1         1         1         1         1         1         1         1
#> 2         1         1         1         1         2         1         1
#> 3         1         1         1         2         1         1         1
#> 4         2         1         1         1         1         1         1
#> 5         1         1         1         1         1         1         1
#> 6         1         1         1         1         1         1         1
#>   c89cop8_r c90cop9_r c161sex_r c172code_r nur_pst_r
#> 1         1         1         1          1        NA
#> 2         1         1         1          1        NA
#> 3         2         1         1          1         1
#> 4         1         2         1          1         1
#> 5         2         2         1          1         1
#> 6         1         1         1          1         1

Arguments

Value

Details

Note

See also

Examples