vignettes/intro_sjlabelled.Rmd
intro_sjlabelled.Rmd
This package provides functions to read and write data between R and other statistical software packages like SPSS, SAS or Stata and to work with labelled data; this includes easy ways to get and set label attributes, to convert labelled vectors into factors (and vice versa), or to deal with multiple declared missing values etc.
This vignette gives an overview of functions to work with labelled data.
Labelled data (or labelled vectors) is a common data structure in other statistical environments to store meta-information about variables, like variable names, value labels or multiple defined missing values.
Labelled data not only extends R’s capabilities to
deal with proper value and variable labels, but also
facilitates the representation of different types of missing values,
like in other statistical software packages. Typically, in R, multiple
declared missings cannot be represented in a similar way, like in ‘SPSS’
or ‘SAS’, with the regular missing values. However, the
haven-package introduced tagged_na
values,
which can do this. Tagged NA’s work exactly like regular R missing
values except that they store one additional byte of information: a tag,
which is usually a letter (“a” to “z”) or also may be a character number
(“0” to “9”). This allows to indicate different missings.
Functions of sjlabelled do not necessarily require
vectors of class labelled
or haven_labelled
.
The labelled
class, implemented by the packages
haven and labelled, may cause troubles
with other packages, thus it’s only intended as being an intermediate
data structure that should be converted to common R classes. However,
coercing a labelled
vector to other classes (like factor or
numeric) typically means that meta information like value and variable
label attributes are lost. Actually, there is no need to drop these
attributes for non-labelled
-class vectors. Functions like
lm()
simply copy these attributes to the data that is
included in the returned object. Packages like sjPlot
support labelled data for easily annotated data visualization.
sjlabelled supports working with labelled data
and offers functions to benefit from these features.
Note: Since package-version 2.0 of the
haven-package, the labelled
-class
attribute was changed to haven_labelled
, to avoid
interferences with the Hmisc-package.
The labelled-package is intended to support
labelled
/ haven_labelled
metadata structures,
thus the data structure of labelled vectors in haven
and labelled is the same.
Labelled data in this format stores information about value labels,
variable names and multiple defined missing values. However,
variable names are only part of this information if data was
imported with one of haven’s read-functions. Adding a
variable label attribute is (at least up to version 1.0.0) not possible
via the labelled()
-constructor method.
library(haven)
x <- labelled(
c(1:3, tagged_na("a", "c", "z"), 4:1),
c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"),
"Refused" = tagged_na("a"), "Not home" = tagged_na("z"))
)
print(x)
#> <labelled<double>[10]>
#> [1] 1 2 3 NA(a) NA(c) NA(z) 4 3 2 1
#>
#> Labels:
#> value label
#> 1 Agreement
#> 4 Disagreement
#> NA(c) First
#> NA(a) Refused
#> NA(z) Not home
A labelled
vector can either be a numeric or character
vector. Conversion to factors copies the value labels as factor levels,
but drops the label attributes and missing information:
is.na(x)
#> [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
as_factor(x)
#> [1] Agreement 2 3 Refused First
#> [6] Not home Disagreement 3 2 Agreement
#> Levels: Agreement 2 3 Disagreement Refused First Not home
is.na(as_factor(x))
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
sjlabelled supports label attributes in
haven-style (label
and
labels
). You’re not restricted to the labelled
class for vectors when working with sjlabelled and
labelled data. Hence, you can have vectors of common R classes and still
use information like variable or value labels.
library(sjlabelled)
# sjlabelled-sample data, an atomic vector with label attributes
data(efc)
str(efc$e16sex)
#> num [1:908] 2 2 2 2 2 2 1 2 2 2 ...
#> - attr(*, "label")= chr "elder's gender"
#> - attr(*, "labels")= Named num [1:2] 1 2
#> ..- attr(*, "names")= chr [1:2] "male" "female"
The get_labels()
-method is a generic method to return
value labels of a vector or data frame.
get_labels(efc$e42dep)
#> [1] "independent" "slightly dependent" "moderately dependent"
#> [4] "severely dependent"
You can prefix the value labels with the associated values or return
them as named vector with the values
argument.
get_labels(efc$e42dep, values = "p")
#> [1] "[1] independent" "[2] slightly dependent"
#> [3] "[3] moderately dependent" "[4] severely dependent"
get_labels()
also returns “labels” of factors, even if
the factor has no label attributes.
x <- factor(c("low", "mid", "low", "hi", "mid", "low"))
get_labels(x)
#> [1] "hi" "low" "mid"
To ensure that labels are only returned for vectors with
label-attribute, use the attr.only
argument.
x <- factor(c("low", "mid", "low", "hi", "mid", "low"))
get_labels(x, attr.only = TRUE)
#> NULL
If a vector has a label attribute, only these labels are returned. Non-labelled values are excluded from the output by default…
# get labels, including tagged NA values
x <- labelled(
c(1:3, tagged_na("a", "c", "z"), 4:1),
c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"),
"Refused" = tagged_na("a"), "Not home" = tagged_na("z"))
)
get_labels(x)
#> [1] "Agreement" "Disagreement"
… however, you can add non-labelled values to the return value as
well, using the non.labelled
argument.
get_labels(x, non.labelled = TRUE)
#> [1] "Agreement" "2" "3" "Disagreement"
Tagged missing values can also be included in the output, using the
drop.na
argument.
get_labels(x, values = "n", drop.na = FALSE)
#> 1 4 NA(c) NA(a) NA(z)
#> "Agreement" "Disagreement" "First" "Refused" "Not home"
The get_values()
method returns the values for labelled
values (i.e. values that have an associated label). We still use the
vector x
from the above examples.
print(x)
#> <labelled<double>[10]>
#> [1] 1 2 3 NA(a) NA(c) NA(z) 4 3 2 1
#>
#> Labels:
#> value label
#> 1 Agreement
#> 4 Disagreement
#> NA(c) First
#> NA(a) Refused
#> NA(z) Not home
get_values(x)
#> [1] "1" "4" "NA(a)" "NA(c)" "NA(z)"
With the drop.na
argument you can omit those values from
the return values that are defined as missing.
get_values(x, drop.na = TRUE)
#> [1] 1 4
With set_labels()
you can add label attributes to any
vector.
x <- sample(1:4, 20, replace = TRUE)
# return new labelled vector
x <- set_labels(x, labels = c("very low", "low", "mid", "hi"))
x
#> [1] 1 3 4 3 1 2 3 3 3 4 2 1 1 1 3 4 3 1 2 2
#> attr(,"labels")
#> very low low mid hi
#> 1 2 3 4
If more labels than values are given, only as many labels elements are used as values are present.
x <- c(2, 2, 3, 3, 2)
x <- set_labels(x, labels = c("a", "b", "c"))
#> More labels than values of "x". Using first 2 labels.
x
#> [1] 2 2 3 3 2
#> attr(,"labels")
#> a b
#> 2 3
However, you can force to use all labels, even for values that are
not in the vector, using the force.labels
argument.
x <- c(2, 2, 3, 3, 2)
x <- set_labels(
x,
labels = c("a", "b", "c"),
force.labels = TRUE
)
x
#> [1] 2 2 3 3 2
#> attr(,"labels")
#> a b c
#> 1 2 3
For vectors with more unique values than labels, additional labels for non-labelled values are added.
x <- c(1, 2, 3, 2, 4, NA)
x <- set_labels(x, labels = c("yes", "maybe", "no"))
#> More values in "x" than length of "labels". Additional values were added to labels.
x
#> [1] 1 2 3 2 4 NA
#> attr(,"labels")
#> yes maybe no 4
#> 1 2 3 4
Use force.values
to add only those labels that have been
passed as argument.
x <- c(1, 2, 3, 2, 4, NA)
x <- set_labels(
x,
labels = c("yes", "maybe", "no"),
force.values = FALSE
)
#> "x" has more values than "labels", hence not all values are labelled.
x
#> [1] 1 2 3 2 4 NA
#> attr(,"labels")
#> yes maybe no
#> 1 2 3
To add explicit labels for values (without adding more labels than
wanted and without dropping labels for values that do not appear in the
vector), use a named vector of labels as argument. The arguments
force.values
and force.labels
are ignored when
using named vectors.
x <- c(1, 2, 3, 2, 4, 5)
x <- set_labels(
x,
labels = c("strongly agree" = 1,
"totally disagree" = 4,
"refused" = 5,
"missing" = 9)
)
x
#> [1] 1 2 3 2 4 5
#> attr(,"labels")
#> strongly agree totally disagree refused missing
#> 1 4 5 9
If you want to set different value labels for a complete data frame,
if you provide the labels as a list
. For each variable in
the data frame, provide a list element with value labels as character
vector. Note that the length of the list must be equal to the number of
variables (columns) in the data frame.
tmp <- data.frame(
a = c(1, 2, 3),
b = c(1, 2, 3),
c = c(1, 2, 3)
)
labels <- list(
c("one", "two", "three"),
c("eins", "zwei", "drei"),
c("un", "dos", "tres")
)
tmp <- set_labels(tmp, labels = labels)
str(tmp)
#> 'data.frame': 3 obs. of 3 variables:
#> $ a: num 1 2 3
#> ..- attr(*, "labels")= Named num [1:3] 1 2 3
#> .. ..- attr(*, "names")= chr [1:3] "one" "two" "three"
#> $ b: num 1 2 3
#> ..- attr(*, "labels")= Named num [1:3] 1 2 3
#> .. ..- attr(*, "names")= chr [1:3] "eins" "zwei" "drei"
#> $ c: num 1 2 3
#> ..- attr(*, "labels")= Named num [1:3] 1 2 3
#> .. ..- attr(*, "names")= chr [1:3] "un" "dos" "tres"
You can use set_labels()
within a pipe-workflow with
dplyr.
library(dplyr)
library(sjmisc) # for frq()
data(efc)
efc %>%
select(c82cop1, c83cop2, c84cop3) %>%
set_labels(labels = c("not often" = 1, "very often" = 4)) %>%
frq()
#> do you feel you cope well as caregiver? (c82cop1) <numeric>
#> # total N=908 valid N=901 mean=3.12 sd=0.58
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> ---------------------------------------------------
#> 1 | not often | 3 | 0.33 | 0.33 | 0.33
#> 2 | 2 | 97 | 10.68 | 10.77 | 11.10
#> 3 | 3 | 591 | 65.09 | 65.59 | 76.69
#> 4 | very often | 210 | 23.13 | 23.31 | 100.00
#> <NA> | <NA> | 7 | 0.77 | <NA> | <NA>
#>
#> do you find caregiving too demanding? (c83cop2) <numeric>
#> # total N=908 valid N=902 mean=2.02 sd=0.72
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> ---------------------------------------------------
#> 1 | not often | 186 | 20.48 | 20.62 | 20.62
#> 2 | 2 | 547 | 60.24 | 60.64 | 81.26
#> 3 | 3 | 130 | 14.32 | 14.41 | 95.68
#> 4 | very often | 39 | 4.30 | 4.32 | 100.00
#> <NA> | <NA> | 6 | 0.66 | <NA> | <NA>
#>
#> does caregiving cause difficulties in your relationship with your friends? (c84cop3) <numeric>
#> # total N=908 valid N=902 mean=1.63 sd=0.87
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> ---------------------------------------------------
#> 1 | not often | 516 | 56.83 | 57.21 | 57.21
#> 2 | 2 | 252 | 27.75 | 27.94 | 85.14
#> 3 | 3 | 82 | 9.03 | 9.09 | 94.24
#> 4 | very often | 52 | 5.73 | 5.76 | 100.00
#> <NA> | <NA> | 6 | 0.66 | <NA> | <NA>
The get_label()
-method returns the variable label of a
vector or all variable labels from a data frame.
get_label(efc$e42dep)
#> [1] "elder's dependency"
get_label(efc, e42dep, e16sex, e15relat)
#> e42dep e16sex e15relat
#> "elder's dependency" "elder's gender" "relationship to elder"
If a vector has no variable label, NULL
is returned.
However, get_label()
also allows returning a standard value
instead of NULL
, in case the vector has no label attribute.
This is useful to combine with deparse(substitute())
in
function calls, so - for instance - the name of the vector can be used
as default value if no variable labels are present.
dummy <- c(1, 2, 3)
testit <- function(x) get_label(x, def.value = deparse(substitute(x)))
# returns name of vector, if it has no variable label
testit(dummy)
#> [1] "dummy"
If you want human-readable labels, you can use the
case
-argument, which will pass the labels to a string
parser in the snakecase-package.
data(iris)
# returns no labels, because iris-data is not labelled
get_label(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> "" "" "" "" ""
# returns the column name as default labels, if data is not labelled
get_label(iris, def.value = colnames(iris))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
# labels are parsed in a readable way
get_label(iris, def.value = colnames(iris), case = "parsed")
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> "Sepal Length" "Sepal Width" "Petal Length" "Petal Width" "Species"
The set_label()
function adds the variable label
attribute to a vector. You can either return a new vector, or label an
existing vector
x <- sample(1:4, 10, replace = TRUE)
# return new vector
x <- set_label(x, label = "Dummy-variable")
str(x)
#> int [1:10] 2 3 1 1 4 1 2 2 3 3
#> - attr(*, "label")= chr "Dummy-variable"
# label existing vector
set_label(x) <- "Another Dummy-variable"
str(x)
#> int [1:10] 2 3 1 1 4 1 2 2 3 3
#> - attr(*, "label")= chr "Another Dummy-variable"
set_label()
can also set variable labels for a data
frame. In this case, the variable attributes get an additional
name
attribute with the vector’s name. This makes it easier
to see which label belongs to which vector.
x <- data.frame(
a = sample(1:4, 10, replace = TRUE),
b = sample(1:4, 10, replace = TRUE),
c = sample(1:4, 10, replace = TRUE)
)
x <- set_label(x, label = c("Variable A",
"Variable B",
"Variable C"))
str(x)
#> 'data.frame': 10 obs. of 3 variables:
#> $ a: int 4 4 2 3 1 2 3 3 4 1
#> ..- attr(*, "label")= Named chr "Variable A"
#> .. ..- attr(*, "names")= chr "a"
#> $ b: int 2 1 2 3 2 4 1 2 1 4
#> ..- attr(*, "label")= Named chr "Variable B"
#> .. ..- attr(*, "names")= chr "b"
#> $ c: int 4 2 3 1 2 2 3 2 1 1
#> ..- attr(*, "label")= Named chr "Variable C"
#> .. ..- attr(*, "names")= chr "c"
get_label(x)
#> a b c
#> "Variable A" "Variable B" "Variable C"
An alternative to set_label()
is
var_labels()
, which also works within pipe-workflows.
var_labels()
requires named vectors as arguments to match
the column names of the input, and set the associated variable
labels.
x <- data.frame(
a = sample(1:4, 10, replace = TRUE),
b = sample(1:4, 10, replace = TRUE),
c = sample(1:4, 10, replace = TRUE)
)
library(magrittr) # for pipe
x %>%
var_labels(
a = "Variable A",
b = "Variable B",
c = "Variable C"
) %>%
str()
#> 'data.frame': 10 obs. of 3 variables:
#> $ a: int 3 2 1 4 1 4 4 3 4 4
#> ..- attr(*, "label")= chr "Variable A"
#> $ b: int 4 3 2 3 2 3 2 3 2 4
#> ..- attr(*, "label")= chr "Variable B"
#> $ c: int 2 1 3 2 2 3 1 2 4 4
#> ..- attr(*, "label")= chr "Variable C"
set_na()
converts values of a vector or of multiple
vectors in a data frame into NA
s. With
as.tag = TRUE
, set_na()
creates tagged
NA
values, which means that these missing values get an
information tag and a value label (which is, by default, the former
value that was converted to NA). You can either return a new vector/data
frame, or set NA
s into an existing vector/data frame.
x <- sample(1:8, 100, replace = TRUE)
# show value distribution
table(x)
#> x
#> 1 2 3 4 5 6 7 8
#> 16 13 12 16 15 11 11 6
# set value 1 and 8 as tagged missings
x <- set_na(x, na = c(1, 8), as.tag = TRUE)
x
#> [1] NA 4 NA 7 4 6 NA 2 2 6 7 4 2 4 6 6 5 7 3 4 NA 5 NA 5 6
#> [26] NA NA 5 2 NA 5 7 3 NA 4 7 7 NA 2 NA NA 3 NA 3 5 6 5 5 NA 5
#> [51] 3 5 2 7 4 3 4 2 NA 2 6 3 4 3 NA NA 2 4 5 NA 7 5 4 NA 3
#> [76] 7 NA 6 3 4 7 3 2 NA 6 NA 4 4 7 2 4 6 5 2 5 2 3 4 5 6
#> attr(,"labels")
#> 1 8
#> NA NA
# show value distribution, including missings
table(x, useNA = "always")
#> x
#> 2 3 4 5 6 7 <NA>
#> 13 12 16 15 11 11 22
# now let's see, which NA's were "1" and which were "8"
print_tagged_na(x)
#> [1] NA(1) 4 NA(1) 7 4 6 NA(1) 2 2 6 7 4
#> [13] 2 4 6 6 5 7 3 4 NA(1) 5 NA(1) 5
#> [25] 6 NA(1) NA(1) 5 2 NA(8) 5 7 3 NA(1) 4 7
#> [37] 7 NA(8) 2 NA(8) NA(1) 3 NA(1) 3 5 6 5 5
#> [49] NA(1) 5 3 5 2 7 4 3 4 2 NA(1) 2
#> [61] 6 3 4 3 NA(8) NA(1) 2 4 5 NA(8) 7 5
#> [73] 4 NA(1) 3 7 NA(1) 6 3 4 7 3 2 NA(8)
#> [85] 6 NA(1) 4 4 7 2 4 6 5 2 5 2
#> [97] 3 4 5 6
x <- factor(c("a", "b", "c"))
x
#> [1] a b c
#> Levels: a b c
# set NA into existing vector
x <- set_na(x, na = "b", as.tag = TRUE)
x
#> [1] a <NA> c
#> attr(,"labels")
#> b
#> NA
#> Levels: a c
The get_na()
function returns all tagged NA values. We
still use the vector x
from the previous example.
get_na(x)
#> b
#> NA
To see the tags of the NA values, use the as.tag
argument.
get_na(x, as.tag = TRUE)
#> b
#> "NA(b)"
While set_na()
allows you to replace values with
(tagged) NA’s, replace_na()
(from package
sjmisc) allows you to replace either all NA values of a
vector or specific tagged NA values with a non-NA value.
library(sjmisc) # for replace_na()
data(efc)
str(efc$c84cop3)
#> num [1:908] 2 3 1 3 1 3 4 2 3 1 ...
#> - attr(*, "label")= chr "does caregiving cause difficulties in your relationship with your friends?"
#> - attr(*, "labels")= Named num [1:4] 1 2 3 4
#> ..- attr(*, "names")= chr [1:4] "Never" "Sometimes" "Often" "Always"
efc$c84cop3 <- set_na(efc$c84cop3, na = c(2, 3), as.tag = TRUE)
get_na(efc$c84cop3, as.tag = TRUE)
#> Sometimes Often
#> "NA(2)" "NA(3)"
# this would replace all NA's into "2"
dummy <- replace_na(efc$c84cop3, value = 2)
# labels of former tagged NA's are preserved
get_labels(dummy, drop.na = FALSE, values = "p")
#> [1] "[1] Never" "[4] Always" "[NA(2)] Sometimes"
#> [4] "[NA(3)] Often"
get_na(dummy, as.tag = TRUE)
#> Sometimes Often
#> "NA(2)" "NA(3)"
# No more NA values
frq(dummy)
#> does caregiving cause difficulties in your relationship with your friends? (x) <numeric>
#> # total N=908 valid N=908 mean=1.55 sd=0.77
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> -----------------------------------------------
#> 1 | Never | 516 | 56.83 | 56.83 | 56.83
#> 2 | 2 | 340 | 37.44 | 37.44 | 94.27
#> 4 | Always | 52 | 5.73 | 5.73 | 100.00
#> <NA> | <NA> | 0 | 0.00 | <NA> | <NA>
# In this example, the tagged NA(2) is replaced with value 2
# the new value label for value 2 is "restored NA"
dummy <- replace_na(efc$c84cop3, value = 2, na.label = "restored NA", tagged.na = "2")
# Only one tagged NA remains
get_labels(dummy, drop.na = FALSE, values = "p")
#> [1] "[1] Never" "[2] restored NA" "[4] Always" "[NA(3)] Often"
get_na(dummy, as.tag = TRUE)
#> Often
#> "NA(3)"
# Some NA values remain
frq(dummy)
#> does caregiving cause difficulties in your relationship with your friends? (x) <numeric>
#> # total N=908 valid N=820 mean=1.50 sd=0.79
#>
#> Value | Label | N | Raw % | Valid % | Cum. %
#> ----------------------------------------------------
#> 1 | Never | 516 | 56.83 | 62.93 | 62.93
#> 2 | restored NA | 252 | 27.75 | 30.73 | 93.66
#> 4 | Always | 52 | 5.73 | 6.34 | 100.00
#> <NA> | <NA> | 88 | 9.69 | <NA> | <NA>
With replace_labels()
, you can replace (change) value
labels of labelled values. This can also be used to change the labels of
tagged missing values. Make sure to know the missing tag, which can be
accessed via get_na()
.
str(efc$c82cop1)
#> num [1:908] 3 3 2 4 3 2 4 3 3 3 ...
#> - attr(*, "label")= chr "do you feel you cope well as caregiver?"
#> - attr(*, "labels")= Named num [1:4] 1 2 3 4
#> ..- attr(*, "names")= chr [1:4] "never" "sometimes" "often" "always"
efc$c82cop1 <- set_na(efc$c82cop1, na = c(2, 3), as.tag = TRUE)
get_na(efc$c82cop1, as.tag = TRUE)
#> sometimes often
#> "NA(2)" "NA(3)"
efc$c82cop1 <- replace_labels(efc$c82cop1, labels = c("new NA label" = tagged_na("2")))
#> tagged NA 'sometimes' was replaced with new value label.
get_na(efc$c82cop1, as.tag = TRUE)
#> new NA label often
#> "NA(2)" "NA(3)"