Title: | A Suite of Checks for Identification of Potential Errors in a Data Frame as Part of the Data Screening Process |
---|---|
Description: | Data screening is an important first step of any statistical analysis. dataMaid auto generates a customizable data report with a thorough summary of the checks and the results that a human can use to identify possible errors. It provides an extendable suite of test for common potential errors in a dataset. |
Authors: | Anne Helby Petersen [aut], Claus Thorn Ekstrøm [aut, cre] |
Maintainer: | Claus Thorn Ekstrøm <[email protected]> |
License: | GPL-2 |
Version: | 1.4.0 |
Built: | 2025-01-01 03:51:08 UTC |
Source: | https://github.com/ekstroem/dataMaid |
Produce an overview of all functions of class checkFunction
available in the workspace or imported from packages. This overview includes
the descriptions and a list of what classes the functions are each intended
to be called on.
allCheckFunctions()
allCheckFunctions()
An object of class functionSummary
. This object has entries $name
(the function names), $description
(the function descriptions, as obtained from their
description
attributes) and $classes
(the classes each function is indeded
to be called on, as obtained from their classes
attributes).
checkFunction
allVisualFunctions
allSummaryFunctions
allCheckFunctions()
allCheckFunctions()
dataMaid
Returns the names of the eight data classes for which
dataMaid
is implemented, namely "character"
, "Date"
,
"factor"
, "integer"
, "labelled"
,
"haven_labelled"
, "logical"
and
"numeric"
.
allClasses()
allClasses()
allClasses()
allClasses()
Produce an overview of all functions of class summaryFunction
available in the workspace or imported from packages. This overview includes
the descriptions and a list of what classes the functions are each intended
to be called on.
allSummaryFunctions()
allSummaryFunctions()
An object of class functionSummary
. This object has entries $name
(the function names), $description
(the function descriptions, as obtained from their
description
attributes) and $classes
(the classes each function is indeded
to be called on, as obtained from their classes
attributes).
summaryFunction
allVisualFunctions
allCheckFunctions
allSummaryFunctions()
allSummaryFunctions()
Produce an overview of all functions of class visualFunction
available in the workspace or imported from packages. This overview includes
the descriptions and a list of what classes the functions are each intended
to be called on.
allVisualFunctions()
allVisualFunctions()
An object of class functionSummary
. This object has entries $name
(the function names), $description
(the function descriptions, as obtained from their
description
attributes) and $classes
(the classes each function is indeded
to be called on, as obtained from their classes
attributes).
visualFunction
allCheckFunctions
allSummaryFunctions
allVisualFunctions()
allVisualFunctions()
A dataset with information about 200 painting and their painters. Each observation in the dataset corresponds to a painting. A single artificial variable, namely an artist ID variable, has been included. Otherwise the information should be truthful.
artData
artData
A data frame with 200 rows and 11 variables.
A unique ID used for cataloging the artists (fictional).
The name of the artist.
The number of middlenames the artist has.
The title of the painting.
The approximate year in which the painting was made.
The current location of the painting.
The continent of the current location of the painting.
The width of the painting, in centimeters.
The height of the painting, in centimers.
The media/materials of the painting.
The artistic movement(s) the painting belongs to.
Semi-artificial dataset constructed based on the Master Works of Art dataset available from Data Explorer.
data(artData)
data(artData)
plot
and
barplot
Plot the distribution of a variable, depending on its data class, using the base R
plotting functions. Note that basicVisual
is a visualFunction
, compatible with the
visualize
and makeDataReport
functions.
basicVisual(v, vnam, doEval = TRUE)
basicVisual(v, vnam, doEval = TRUE)
v |
The variable (vector) to be plotted. |
vnam |
The name of the variable which will appear as the title of the plot. |
doEval |
If TRUE, the plot itself is returned. Otherwise, the function returns a character string containing standalone R code for producing the plot. |
For character, factor, logical and (haven_)labelled variables, a barplot is produced. For numeric,
integer or Date variables, basicVisual
produces a histogram instead. Note that for
integer and numeric variables, all non-finite (i.e. NA
, NaN
, Inf
) values are
removed prior to plotting. For character, factor, (haven_)labelled and logical variables, only NA
values are removed.
## Not run: #Save a variable myVar <- c(1:10) #Plot a variable basicVisual(myVar, "MyVar") #Produce code for plotting a variable basicVisual(myVar, "MyVar", doEval = FALSE) ## End(Not run)
## Not run: #Save a variable myVar <- c(1:10) #Plot a variable basicVisual(myVar, "MyVar") #Produce code for plotting a variable basicVisual(myVar, "MyVar", doEval = FALSE) ## End(Not run)
importFrom stats na.omit
basicVisualCFLB(v, vnam, doEval = TRUE)
basicVisualCFLB(v, vnam, doEval = TRUE)
v |
The variable (vector) to be plotted. |
vnam |
The name of the variable which will appear as the title of the plot. |
doEval |
If TRUE, the plot itself is returned. Otherwise, the function returns a character string containing standalone R code for producing the plot. |
A dataset with information about the first 45 US presidents as well as a 46th
person, who is not a US president, and a duplicate of one of the 45 actual presidents.
The dataset was constructed to show the capabilities
of dataMaid
and therefore, it has been constructed to include errors and miscodings.
Each observation in the dataset corresponds to a person. The dataset uses the
non-standard class Name
which is simply an attribute that has been added to
two variables in order to show how dataMaid
handles non-supported classes. Note that the dataset
is an extended and more error-filled version of the dataset presidentData
which is
also included in the package.
bigPresidentData
bigPresidentData
A data frame with 47 rows and 15 variables.
A Name
type variable containing the last name of the president.
A Name
type variable containing the first name of the president.
A factor variable indicating the order of the presidents (with George Washington as number 1 and Donald Trump as number 45).
A Date variable with the birthday of the president.
A Date variable with the date of the president's death.
A character variable with the state in which the president was born.
A charcter variable with the party to which the president was associated.
A Date variable with the date of inauguration of the president.
A Date variable with the date at which the presidency ends.
A numeric variable indicating whether there was an assassination
attempt (1
) or not (0
) on the president.
A factor variable with the sex of the president.
A factor variable with the ethnicity of the president.
A numeric variable with the duration of the presidency, in years.
A character variable with the age at inauguration.
A complex
type variable with a fictional favorite number for
each president.
Artificial dataset constructed based on the US president dataset available from Data Explorer.
Petersen AH, Ekstrøm CT (2019). “dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R.” _Journal of Statistical Software_, *90*(6), 1-38. doi: 10.18637/jss.v090.i06 ( https://doi.org/10.18637/jss.v090.i06).
data(bigPresidentData)
data(bigPresidentData)
A summaryFunction
, intended to be called from
summarize
, which returns the central value of a variable.
For numeric and integer variables, this is the median. For
character, factor, (have_)labelled, Date and logical variables, the central value is the mode
(i.e. the value that occurs the largest number of times).
centralValue(v, ...)
centralValue(v, ...)
v |
A variable (vector). |
... |
Extra arguments to be passed to class-specific functions. These incluse
|
Note that NA, NaN and Inf values are ignored for numeric and integer variables, while only NA values are ignored for factor, character, Date and (haven_)labelled variables. No values are ignored for logical variables.
An object of class summaryResult
with the following entries: $feature
(the mode/median),$result
(the central value of v
) and $value
(identical
to $result
).
If the mode is returned and it is not uniquely determined, the first value qualifying as a mode is
returned, when the variable is sorted according to sort
.
summaryFunction
, summarize
, summaryResult
,
allSummaryFunctions
#central value of an integer variable: centralValue(c(rep(1, 25), rep(2, 10), rep(3, 20))) #central value of a character variable: centralValue(as.character(c(rep(1, 20), rep(2, 10), rep(3, 20))))
#central value of an integer variable: centralValue(c(rep(1, 25), rep(2, 10), rep(3, 20))) #central value of a character variable: centralValue(as.character(c(rep(1, 20), rep(2, 10), rep(3, 20))))
Run a set of validation checks to check a variable vector or a full dataset for potential errors. Which checks are performed depends on the class of the variable and on user inputs.
check(v, nMax = 10, checks = setChecks(), ...)
check(v, nMax = 10, checks = setChecks(), ...)
v |
the vector or the dataset ( |
nMax |
If a check is supposed to identify problematic values,
this argument controls if all of these should be pasted onto the outputted
message, or if only the first |
checks |
A list of checks to use on each supported variable type. We recommend
using |
... |
Other arguments that are passed on to the checking functions.
These includes general parameters controlling how the check results are
formatted (e.g. |
It should be noted that the default options for each variable type
are returned by calling e.g. defaultCharacterChecks()
,
defaultFactorChecks()
, defaultNumericChecks()
, etc. A complete
overview of all default options can be obtained by calling setChecks()
.
Moreover, all available checkFunction
s (including both locally defined
functions and functions imported from dataMaid
or other packages) can
be viewed by calling allCheckFunctions()
.
If v
is a variable, a list of objects of class
checkResult
, which each summarizes the result of a
checkFunction
call performed on v
.
See checkResult
for more details. If V
is a
data.frame
, a list of lists of the form above
is returned instead with one entry for each variable in v
.
Petersen AH, Ekstrøm CT (2019). “dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R.” _Journal of Statistical Software_, *90*(6), 1-38. doi: 10.18637/jss.v090.i06 ( https://doi.org/10.18637/jss.v090.i06).
setChecks
,
allCheckFunctions
checkResult
checkFunction
, defaultCharacterChecks
,
defaultFactorChecks
, defaultLabelledChecks
,
defaultHavenlabelledChecks
,
defaultNumericChecks
, defaultIntegerChecks
,
defaultLogicalChecks
, defaultDateChecks
x <- 1:5 check(x) #Annoyingly coded missing as 99 y <- c(rnorm(100), rep(99, 10)) check(y) #Check y for outliers and print 4 decimals for problematic variables check(y, checks = setChecks(numeric = "identifyOutliers"), maxDecimals = 4) #Change what checks are performed on a variable, now only identifyMissing is called # for numeric variables check(y, checks = setChecks(numeric = "identifyMissing")) #Check a full data.frame at once data(cars) check(cars) #Check a full data.frame at once, while changing the standard settings for #several data classes at once. Here, we ommit the check of miscoded missing values for factors #and we only do this check for numeric variables: check(cars, checks = setChecks(factor = defaultFactorChecks(remove = "identifyMissing"), numeric = "identifyMissing"))
x <- 1:5 check(x) #Annoyingly coded missing as 99 y <- c(rnorm(100), rep(99, 10)) check(y) #Check y for outliers and print 4 decimals for problematic variables check(y, checks = setChecks(numeric = "identifyOutliers"), maxDecimals = 4) #Change what checks are performed on a variable, now only identifyMissing is called # for numeric variables check(y, checks = setChecks(numeric = "identifyMissing")) #Check a full data.frame at once data(cars) check(cars) #Check a full data.frame at once, while changing the standard settings for #several data classes at once. Here, we ommit the check of miscoded missing values for factors #and we only do this check for numeric variables: check(cars, checks = setChecks(factor = defaultFactorChecks(remove = "identifyMissing"), numeric = "identifyMissing"))
Convert a function, f
, into an S3
checkFunction
object. This adds f
to the
overview list returned by an allCheckFunctions()
call.
checkFunction(f, description = NULL, classes = NULL)
checkFunction(f, description = NULL, classes = NULL)
f |
A function. See details and examples below for the exact requirements of this function. |
description |
A character string describing the check
performed by |
classes |
The classes for which |
checkFunction
represents the functions used in
check
and makeDataReport
for performing
error checks and quality control on variables in dataset.
An example of defining a new checkFunction
is given below.
Note that the minimal requirements for such a function (in order for it to be
compatible with check()
and makeDataReport()
) is the following
input/output-structure: It must input at least two arguments, namely
v
(a vector variable) and ...
. Additional implemented
arguments from check()
and makeDataReport()
include nMax
and
maxDecimals
, see e.g. the pre-defined checkFunction
identifyMissing
for more details about how these arguments should
be used.
The output must be a list with at least the two entries $problem
(a logical indicating whether a problem was found) and $message
(a character string message describing the problem). However, if the
result of a checkFunction
is furthermore appended with a
$problemValues
entry (including the values from the variable
that caused the problem, if relevant) and converted to a
checkResult
object, a print()
method also becomes
available for consistent formatting of checkFunction
results.
Note that all available checkFunction
s are listed by the call
allCheckFunctions()
and we recommed looking into these function,
if more knowledge about checkFunction
s is required.
A function of class checkFunction
which has to attributes,
namely classes
and description
.
allCheckFunctions
, check
, makeDataReport
,
messageGenerator
, checkResult
#Define a minimal requirement checkFunction that can be called #from check() and makeDataReport(). This function checks whether all #values in a variable are of equal length and that this #length is then also larger than 10: isID <- function(v, nMax = NULL, ...) { out <- list(problem = FALSE, message = "") if (class(v) %in% c("character", "factor", "labelled", "haven_labelled", "numeric", "integer")) { v <- as.character(v) lengths <- nchar(v) if (all(lengths > 10) & length(unique(lengths)) == 1) { out$problem <- TRUE out$message <- "Warning: This variable seems to contain ID codes!" } } out } #Convert it into a checkFunction isID <- checkFunction(isID, description = "Identify ID variables (long, equal length values)", classes = allClasses()) #Call isID isID(c("12345678901", "23456789012", "34567890123", "45678901234")) #isID now appears in a allCheckFunctions() call: allCheckFunctions() #Define a new checkFunction using messageGenerator() for generating #the message and checkResult() for getting a printing method #for its output. This function identifies values in a variable #that include a colon, surrounded by alphanumeric characters. If #at least one such value is found, the variable is flagged as #having a problem: identifyColons <- function(v, nMax = Inf, ... ) { v <- unique(na.omit(v)) problemMessage <- "Note: The following values include colons:" problem <- FALSE problemValues <- NULL problemValues <- v[sapply(gregexpr("[[:xdigit:]]:[[:xdigit:]]", v), function(x) all(x != -1))] if (length(problemValues) > 0) { problem <- TRUE } problemStatus <- list(problem = problem, problemValues = problemValues) outMessage <- messageGenerator(problemStatus, problemMessage, nMax) checkResult(list(problem = problem, message = outMessage, problemValues = problemValues)) } #Make it a checkFunction: identifyColons <- checkFunction(identifyColons, description = "Identify non-suffixed nor -prefixed colons", classes = c("character", "factor", "labelled", "haven_labelled")) #Call it: identifyColons(1:100) identifyColons(c("a:b", 1:10, ":b", "a:b:c:d")) #identifyColons now appears in a allCheckFunctions() call: allCheckFunctions() #Define a checkFunction that looks for negative values in numeric #or integer variables: identifyNeg <- function(v, nMax = Inf, maxDecimals = 2, ...) { problem <- FALSE problemValues <- printProblemValues <- NULL problemMessage <- "Note: The following negative values were found:" negOcc <- unique(v[v < 0]) if (length(negOcc > 0)) { problemValues <- negOcc printProblemValues <- round(negOcc, maxDecimals) problem <- TRUE } outMessage <- messageGenerator(list(problem = problem, problemValues = printProblemValues), problemMessage, nMax) checkResult(list(problem = problem, message = outMessage, problemValues = problemValues)) } #Make it a checkFunction identifyNeg <- checkFunction(identifyNeg, "Identify negative values", classes = c("integer", "numeric")) #Call it: identifyNeg(c(0:100)) identifyNeg(c(-20.1232323:20), nMax = 3, maxDecimals = 4) #identifyNeg now appears in a allCheckFunctions() call: allCheckFunctions()
#Define a minimal requirement checkFunction that can be called #from check() and makeDataReport(). This function checks whether all #values in a variable are of equal length and that this #length is then also larger than 10: isID <- function(v, nMax = NULL, ...) { out <- list(problem = FALSE, message = "") if (class(v) %in% c("character", "factor", "labelled", "haven_labelled", "numeric", "integer")) { v <- as.character(v) lengths <- nchar(v) if (all(lengths > 10) & length(unique(lengths)) == 1) { out$problem <- TRUE out$message <- "Warning: This variable seems to contain ID codes!" } } out } #Convert it into a checkFunction isID <- checkFunction(isID, description = "Identify ID variables (long, equal length values)", classes = allClasses()) #Call isID isID(c("12345678901", "23456789012", "34567890123", "45678901234")) #isID now appears in a allCheckFunctions() call: allCheckFunctions() #Define a new checkFunction using messageGenerator() for generating #the message and checkResult() for getting a printing method #for its output. This function identifies values in a variable #that include a colon, surrounded by alphanumeric characters. If #at least one such value is found, the variable is flagged as #having a problem: identifyColons <- function(v, nMax = Inf, ... ) { v <- unique(na.omit(v)) problemMessage <- "Note: The following values include colons:" problem <- FALSE problemValues <- NULL problemValues <- v[sapply(gregexpr("[[:xdigit:]]:[[:xdigit:]]", v), function(x) all(x != -1))] if (length(problemValues) > 0) { problem <- TRUE } problemStatus <- list(problem = problem, problemValues = problemValues) outMessage <- messageGenerator(problemStatus, problemMessage, nMax) checkResult(list(problem = problem, message = outMessage, problemValues = problemValues)) } #Make it a checkFunction: identifyColons <- checkFunction(identifyColons, description = "Identify non-suffixed nor -prefixed colons", classes = c("character", "factor", "labelled", "haven_labelled")) #Call it: identifyColons(1:100) identifyColons(c("a:b", 1:10, ":b", "a:b:c:d")) #identifyColons now appears in a allCheckFunctions() call: allCheckFunctions() #Define a checkFunction that looks for negative values in numeric #or integer variables: identifyNeg <- function(v, nMax = Inf, maxDecimals = 2, ...) { problem <- FALSE problemValues <- printProblemValues <- NULL problemMessage <- "Note: The following negative values were found:" negOcc <- unique(v[v < 0]) if (length(negOcc > 0)) { problemValues <- negOcc printProblemValues <- round(negOcc, maxDecimals) problem <- TRUE } outMessage <- messageGenerator(list(problem = problem, problemValues = printProblemValues), problemMessage, nMax) checkResult(list(problem = problem, message = outMessage, problemValues = problemValues)) } #Make it a checkFunction identifyNeg <- checkFunction(identifyNeg, "Identify negative values", classes = c("integer", "numeric")) #Call it: identifyNeg(c(0:100)) identifyNeg(c(-20.1232323:20), nMax = 3, maxDecimals = 4) #identifyNeg now appears in a allCheckFunctions() call: allCheckFunctions()
Convert a list resulting from the checks performed in a
checkFunction
into a checkResult
object, thereby
supplying it with a print()
method.
checkResult(ls)
checkResult(ls)
ls |
A list with entries |
A S3 object of class checkResult
, identical to the inputted
list, ls
, except for its class attribute.
classes
If the object, x
, is itself of
class checkFunction
, summaryFunction
or visualFunction
, the contents of x
's
attribute classes
is returned. Otherwise, NULL
is
returned.
classes(x) classes(x) <- value
classes(x) classes(x) <- value
x |
The object for which the |
value |
New value |
The classes for which x
is intended to be called,
given as a vector of characters.
#Extract the classes of the checkFunction identifyMissing classes(identifyMissing) #Extract the classes of the summaryFunction minMax classes(minMax) #Extract the classes of the visualFunction basicVisual classes(basicVisual)
#Extract the classes of the checkFunction identifyMissing classes(identifyMissing) #Extract the classes of the summaryFunction minMax classes(minMax) #Extract the classes of the visualFunction basicVisual classes(basicVisual)
A summaryFunction
, intended to be called from
summarize
(and makeDataReport
), which counts the
number of missing (NA
) values in a variable.
countMissing(v, ...)
countMissing(v, ...)
v |
A variable (vector). |
... |
Not in use. |
A summaryResult
object with the following entries:
$feature
("No. missing obs."), $result
(the number and percentage
missing observations) and $value
(the number of missing observations).
summarize
, allSummaryFunctions
,
summaryFunction
, summaryResult
countMissing(c(1:100, rep(NA, 10)))
countMissing(c(1:100, rep(NA, 10)))
Default options for which checks to perform on
character type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultCharacterChecks(remove = NULL, add = NULL)
defaultCharacterChecks(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A vector of function names.
Default options for which summaries to apply on
character type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultCharacterSummaries(remove = NULL, add = NULL)
defaultCharacterSummaries(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A list of function names (as character strings).
variableType
, countMissing
, uniqueValues
,
centralValue
#remove "variableType" from the summaries: defaultCharacterSummaries(remove = "variableType")
#remove "variableType" from the summaries: defaultCharacterSummaries(remove = "variableType")
Default options for which checks to perform on
Date type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultDateChecks(remove = NULL, add = NULL)
defaultDateChecks(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A vector of function names.
Default options for which summaries to apply on
Date type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultDateSummaries(remove = NULL, add = NULL)
defaultDateSummaries(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A list of function names (as character strings).
variableType
, countMissing
, uniqueValues
,
centralValue
, minMax
, quartiles
defaultDateSummaries()
defaultDateSummaries()
Default options for which checks to perform on
factor type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultFactorChecks(remove = NULL, add = NULL)
defaultFactorChecks(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A vector of function names.
Default options for which summaries to apply on
factor type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultFactorSummaries(remove = NULL, add = NULL)
defaultFactorSummaries(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A list of function names (as character strings).
codevariableType, countMissing
, uniqueValues
,
centralValue
#remove "countMissing" for the summaries: defaultFactorSummaries(remove = "countMissing")
#remove "countMissing" for the summaries: defaultFactorSummaries(remove = "countMissing")
Default options for which checks to perform on
haven_labelled type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultHavenlabelledChecks(remove = NULL, add = NULL)
defaultHavenlabelledChecks(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A vector of function names.
Default options for which summaries to apply on
haven_labelled type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultHavenlabelledSummaries(remove = NULL, add = NULL)
defaultHavenlabelledSummaries(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A list of function names (as character strings).
variableType
,
countMissing
, uniqueValues
, centralValue
#remove "centralValue": defaultHavenlabelledSummaries(remove = "centralValue")
#remove "centralValue": defaultHavenlabelledSummaries(remove = "centralValue")
Default options for which checks to perform on
integer type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultIntegerChecks(remove = NULL, add = NULL)
defaultIntegerChecks(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A vector of function names.
Default options for which summaries to apply on
integer type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultIntegerSummaries(remove = NULL, add = NULL)
defaultIntegerSummaries(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A list of function names (as character strings).
variableType
,
countMissing
, uniqueValues
,
centralValue
, quartiles
, minMax
#remove "countMissing": defaultIntegerSummaries(remove = "countMissing")
#remove "countMissing": defaultIntegerSummaries(remove = "countMissing")
Default options for which checks to perform on
labelled type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultLabelledChecks(remove = NULL, add = NULL)
defaultLabelledChecks(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A vector of function names.
Default options for which summaries to apply on
labelled type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultLabelledSummaries(remove = NULL, add = NULL)
defaultLabelledSummaries(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A list of function names (as character strings).
variableType
,
countMissing
, uniqueValues
, centralValue
#remove "centralValue": defaultLabelledSummaries(remove = "centralValue")
#remove "centralValue": defaultLabelledSummaries(remove = "centralValue")
Default options for which checks to perform on
logical type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultLogicalChecks(remove = NULL, add = NULL)
defaultLogicalChecks(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A vector of function names.
Default options for which summaries to apply on
logical type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultLogicalSummaries(remove = NULL, add = NULL)
defaultLogicalSummaries(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A list of function names (as character strings).
variableType
,
countMissing
, uniqueValues
, centralValue
#remove "uniqueValues": defaultLogicalSummaries(remove = "uniqueValues")
#remove "uniqueValues": defaultLogicalSummaries(remove = "uniqueValues")
Default options for which checks to perform on
numeric type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultNumericChecks(remove = NULL, add = NULL)
defaultNumericChecks(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A vector of function names.
Default options for which summaries to apply on
numeric type variables in check
and makeDataReport
,
possibly user-modified by adding extra function names using add
or
removing default function names with remove
.
defaultNumericSummaries(remove = NULL, add = NULL)
defaultNumericSummaries(remove = NULL, add = NULL)
remove |
Character vector of function names. Checks to remove from the returned vector |
add |
Character vector of function names. Checks to add to the returned vector |
A list of function names (as character strings).
variableType
,
countMissing
, uniqueValues
,
centralValue
, quartiles
, minMax
#remove "uniqueValues": defaultNumericSummaries(remove = "uniqueValues")
#remove "uniqueValues": defaultNumericSummaries(remove = "uniqueValues")
description
If the object, x
, is itself of
class checkFunction
, summaryFunction
or visualFunction
, the contents of x
's
attribute description
is returned. Otherwise, NULL
is
returned.
description(x) description(x) <- value
description(x) description(x) <- value
x |
The object for which the |
value |
New value |
A description of what x
does, given as
a character string.
#Extract the description of the checkFunction identifyMissing description(identifyMissing) #Extract the description of the summaryFunction minMax description(minMax) #Extract the description of the visualFunction basicVisual description(basicVisual)
#Extract the description of the checkFunction identifyMissing description(identifyMissing) #Extract the description of the summaryFunction minMax description(minMax) #Extract the description of the visualFunction basicVisual description(basicVisual)
An artificial dataset, intended for presenting the extended features of dataMaid
,
which is a toolset for identifying potential errors in a dataset.
exampleData
exampleData
A data.frame
with 300 observations on the following 6 variables.
addresses
a factor with fictitious US addresses
binomial
a numeric vector with a binomial distributed variable
poisson
a numeric vector with a Poisson distributed variable
gauss
a numeric vector with a Gaussian distributed variable
zigauss
a numeric vector with a zero-inflated Gaussian distributed variable
bpinteraction
a factor with interactions between binomial and poisson values
Artificial data
## Not run: isID <- function(v, nMax = NULL, ...) { out <- list(problem = FALSE, message = "") if (class(v) %in% c("character", "factor", "labelled", "numeric", "integer")) { v <- as.character(v) lengths <- nchar(v) if (all(lengths > 10) & length(unique(lengths)) == 1) { out$problem <- TRUE out$message <- "Warning: This variable seems to contain ID codes!" } } out } countZeros <- function(v, ...) { res <- length(which(v == 0)) summaryResult(list(feature = "No. zeros", result = res, value = res)) } countZeros <- summaryFunction(countZeros, description = "Count number of zeros", classes = allClasses()) summarize(toyData, numericSummaries = c(defaultNumericSummaries())) mosaicVisual <- function(v, vnam, doEval) { thisCall <- call("mosaicplot", table(v), main = vnam, xlab = "") if (doEval) { return(eval(thisCall)) } else return(deparse(thisCall)) } mosaicVisual <- visualFunction(mosaicVisual, description = "Mosaic plots using graphics", classes = allClasses()) identifyColons <- function(v, nMax = Inf, ... ) { v <- unique(na.omit(v)) problemMessage <- "Note: The following values include colons:" problem <- FALSE problemValues <- NULL problemValues <- v[sapply(gregexpr("[[:xdigit:]]:[[:xdigit:]]", v), function(x) all(x != -1))] if (length(problemValues) > 0) { problem <- TRUE } problemStatus <- list(problem = problem, problemValues = problemValues) outMessage <- messageGenerator(problemStatus, problemMessage, nMax) checkResult(list(problem = problem, message = outMessage, problemValues = problemValues)) } identifyColons <- checkFunction(identifyColons, description = "Identify non-suffixed nor -prefixed colons", classes = c("character", "factor", "labelled")) makeDataReport(exampleData, replace = TRUE, preChecks = c("isKey", "isEmpty", "isID"), allVisuals = "mosaicVisual", characterSummaries = c(defaultCharacterSummaries(), "countZeros"), factorSummaries = c(defaultFactorSummaries(), "countZeros"), labelledSummaries = c(defaultLabelledSummaries(), "countZeros"), numericSummaries = c(defaultNumericSummaries(), "countZeros"), integerSummaries = c(defaultIntegerSummaries(), "countZeros"), characterChecks = c(defaultCharacterChecks(), "identifyColons"), factorChecks = c(defaultFactorChecks(), "identifyColons"), labelledCheck = c(defaultLabelledChecks(), "identifyColons")) ## End(Not run)
## Not run: isID <- function(v, nMax = NULL, ...) { out <- list(problem = FALSE, message = "") if (class(v) %in% c("character", "factor", "labelled", "numeric", "integer")) { v <- as.character(v) lengths <- nchar(v) if (all(lengths > 10) & length(unique(lengths)) == 1) { out$problem <- TRUE out$message <- "Warning: This variable seems to contain ID codes!" } } out } countZeros <- function(v, ...) { res <- length(which(v == 0)) summaryResult(list(feature = "No. zeros", result = res, value = res)) } countZeros <- summaryFunction(countZeros, description = "Count number of zeros", classes = allClasses()) summarize(toyData, numericSummaries = c(defaultNumericSummaries())) mosaicVisual <- function(v, vnam, doEval) { thisCall <- call("mosaicplot", table(v), main = vnam, xlab = "") if (doEval) { return(eval(thisCall)) } else return(deparse(thisCall)) } mosaicVisual <- visualFunction(mosaicVisual, description = "Mosaic plots using graphics", classes = allClasses()) identifyColons <- function(v, nMax = Inf, ... ) { v <- unique(na.omit(v)) problemMessage <- "Note: The following values include colons:" problem <- FALSE problemValues <- NULL problemValues <- v[sapply(gregexpr("[[:xdigit:]]:[[:xdigit:]]", v), function(x) all(x != -1))] if (length(problemValues) > 0) { problem <- TRUE } problemStatus <- list(problem = problem, problemValues = problemValues) outMessage <- messageGenerator(problemStatus, problemMessage, nMax) checkResult(list(problem = problem, message = outMessage, problemValues = problemValues)) } identifyColons <- checkFunction(identifyColons, description = "Identify non-suffixed nor -prefixed colons", classes = c("character", "factor", "labelled")) makeDataReport(exampleData, replace = TRUE, preChecks = c("isKey", "isEmpty", "isID"), allVisuals = "mosaicVisual", characterSummaries = c(defaultCharacterSummaries(), "countZeros"), factorSummaries = c(defaultFactorSummaries(), "countZeros"), labelledSummaries = c(defaultLabelledSummaries(), "countZeros"), numericSummaries = c(defaultNumericSummaries(), "countZeros"), integerSummaries = c(defaultIntegerSummaries(), "countZeros"), characterChecks = c(defaultCharacterChecks(), "identifyColons"), factorChecks = c(defaultFactorChecks(), "identifyColons"), labelledCheck = c(defaultLabelledChecks(), "identifyColons")) ## End(Not run)
A checkFunction
to be called from
check
that identifies values in a vector
that appear multiple times with different case settings.
identifyCaseIssues(v, nMax = 10)
identifyCaseIssues(v, nMax = 10)
v |
A character, factor, haven_labelled or labelled variable to check. |
nMax |
The maximum number of problematic values to report.
Default is |
A checkResult
with three entires:
$problem
(a logical indicating whether case issues where found),
$message
(a message describing which values in v
resulted
in case issues) and $problemValues
(the problematic values
in their original format). Note that Only unique problematic values
are listed and they are presented in alphabetical order.
check
, allCheckFunctions
,
checkFunction
, checkResult
identifyCaseIssues(c("val", "b", "1", "1", "vAl", "VAL", "oh", "OH"))
identifyCaseIssues(c("val", "b", "1", "1", "vAl", "VAL", "oh", "OH"))
A checkFunction
to be called from check
that identifies values that
only occur less than 6 times in factor, (haven_)labelled, or character variables (that is, loners).
identifyLoners(v, nMax = 10)
identifyLoners(v, nMax = 10)
v |
A character, (haven_)labelled, or factor variable to check. |
nMax |
The maximum number of problematic values to report.
Default is |
For character, (haven_)labelled, and factor variables, identify values that only have a very low number of observations, as these categories might be problematic when conducting an analysis. Unused factor levels are not considered "loners". "Loners" are defined as values with 5 or less observations, reflecting the commonly use rule of thumb for performing chi squared tests.
A checkResult
with three entires:
$problem
(a logical indicating whether case issues where found),
$message
(a message describing which values in v
were loners) and
$problemValues
(the problematic values in their original format).
Note that Only unique problematic values
are listed and they are presented in alphabetical order.
check
, allCheckFunctions
,
checkFunction
, checkResult
identifyLoners(c(rep(c("a", "b", "c"), 10), "d", "d"))
identifyLoners(c(rep(c("a", "b", "c"), 10), "d", "d"))
A checkFunction to be called from check
that identifies values that
appear to be miscoded missing values.
identifyMissing(v, nMax = 10, ...)
identifyMissing(v, nMax = 10, ...)
v |
A variable to check. |
nMax |
The maximum number of problematic values to report.
Default is |
... |
Not in use. |
identifyMissing
tries to identify common choices of missing values outside of the
R standard (NA
). These include special words (NaN and Inf (no matter the cases)),
one or more -9/9's (e.g. 999, "99", -9, "-99"), one ore more -8/8's (e.g. -8, 888, -8888),
Stata style missing values (commencing with ".") and other character strings
("", " ", "-", "NA" miscoded as character). If the variable is numeric/integer or a
character/factor variable consisting only of numbers and with more than 11 different values,
the numeric miscoded missing values (999, 888, -99, -8 etc.) are
only recognized as miscoded missing if they are maximum or minimum, respectively, and the distance
between the second largest/smallest value and this maximum/minimum value is greater than one.
A checkResult
with three entires:
$problem
(a logical indicating whether midcoded missing values where found),
$message
(a message describing which values in v
were suspected to be
miscoded missing values), and $problemValues
(the problematic values
in their original format). Note that Only unique problematic values
are listed and that they are presented in alphabetical order.
check
, allCheckFunctions
,
checkFunction
, checkResult
##data(testData) ##testData$miscodedMissingVar ##identifyMissing(testData$miscodedMissingVar) #Identify miscoded numeric missing values v1 <- c(1:15, 99) v2 <- c(v1, 98) v3 <- c(-999, v2, 9999) identifyMissing(v1) identifyMissing(v2) identifyMissing(v3) identifyMissing(factor(v3))
##data(testData) ##testData$miscodedMissingVar ##identifyMissing(testData$miscodedMissingVar) #Identify miscoded numeric missing values v1 <- c(1:15, 99) v2 <- c(v1, 98) v3 <- c(-999, v2, 9999) identifyMissing(v1) identifyMissing(v2) identifyMissing(v3) identifyMissing(factor(v3))
A checkFunction
to be called from
check
for identifying numeric variables that have
been misclassified as categorical.
identifyNums(v, nVals = 12, ...)
identifyNums(v, nVals = 12, ...)
v |
A character, factor, or (haven_)labelled variable to check. |
nVals |
An integer determining how many unique values a variable must have
before it can potentially be determined to be a misclassified numeric variable.
The default is |
... |
Not in use. |
A categorical variable is suspected to be a misclassified
numeric variable if it has the following two properties: First,
it should consist exclusively of numbers (possibly including signs
and decimals points). Secondly, it must have at least nVals
unique values.
The default values of nVals
is 12, which means that
e.g. variables including answers on a scale from 0-10 will
not be recognized as misclassified numerics.
A checkResult
with three entires:
$problem
(a logical indicating the variable is suspected to be
a misclassified numeric variable), $message
(if a problem was found,
the following message: "Note: The variable consists exclusively of numbers and takes
a lot of different values. Is it perhaps a misclassified numeric variable?",
otherwise "") and $problemValues
(always NULL
).
check
, allCheckFunctions
,
checkFunction
, checkResult
#Positive and negative numbers, saved as characters identifyNums(c(as.character(-9:9))) #An ordinary character variable identifyNums(c("a", "b", "c", "d", "e.f", "-a", 1:100))
#Positive and negative numbers, saved as characters identifyNums(c(as.character(-9:9))) #An ordinary character variable identifyNums(c("a", "b", "c", "d", "e.f", "-a", 1:100))
A checkFunction to be called from check
that identifies outlier values
in a Date/numeric/integer variable.
identifyOutliers(v, nMax = 10, maxDecimals = 2)
identifyOutliers(v, nMax = 10, maxDecimals = 2)
v |
A Date, numeric or integer variable to check. |
nMax |
The maximum number of problematic values to report.
Default is |
maxDecimals |
A positive integer or |
Outliers are identified based on an outlier rule that is appropriate for asymmetric data. Outliers are observations outside the range
where Q1, Q3, and IQR are the first quartile, third quartile, and inter-quartile range, MC is the 'medcouple', a robust concept and estimator of skewness, and a and b are appropriate constants (-4 and 3). The medcouple is defined as a scaled median difference of the left and right half of distribution, and hence not based on the third moment as the classical skewness.
When the data are symmetric, the measure reduces to the
standard outlier rule also used in Tukey Boxplots (consistent with
the boxplot
function), i.e. as values that are
smaller than the 1st quartile minus the inter quartile range (IQR)
or greater than the third quartile plus the IQR.
For Date variables, the calculations are done on their raw numeric format (as
obtained by using unclass
), after which they are translated back to Dates.
Note that no rounding is performed for Dates, no matter the value of maxDecimals
.
A checkResult
with three entires:
$problem
(a logical indicating whether outliers were found),
$message
(a message describing which values are outliers) and
$problemValues
(the outlier values).
check
, allCheckFunctions
,
checkFunction
, checkResult
, mc
identifyOutliers(c(1:10, 200, 200, 700))
identifyOutliers(c(1:10, 200, 200, 700))
A checkFunction to be called from check
that identifies outlier values
in a numeric/integer/Date variable by use of the Turkey Boxplot method (consistent witht the
boxplot
function).
identifyOutliersTBStyle(v, nMax = 10, maxDecimals = 2)
identifyOutliersTBStyle(v, nMax = 10, maxDecimals = 2)
v |
A numeric, integer or Date variable to check. |
nMax |
The maximum number of problematic values to report.
Default is |
maxDecimals |
A positive integer or |
Outliers are defined in the style of Turkey Boxplots (consistent with the
boxplot
function), i.e. as values that are smaller than the 1st quartile minus
the inter quartile range (IQR) or greater than the third quartile plus the IQR.
For Date variables, the calculations are done on their raw numeric format (as
obtained by using unclass
), after which they are translated back to Dates.
Note that no rounding is performed for Dates, no matter the value of maxDecimals
.
A checkResult
with three entires:
$problem
(a logical indicating whether outliers were found),
$message
(a message describing which values are outliers) and
$problemValues
(the outlier values).
check
, allCheckFunctions
,
checkFunction
, checkResult
identifyOutliersTBStyle(c(1:10, 200, 200, 700))
identifyOutliersTBStyle(c(1:10, 200, 200, 700))
A checkFunction to be called from check
that identifies prefixed and suffixed whitespace(s) in character,
(haven_)labelled or factor variables.
identifyWhitespace(v, nMax = 10)
identifyWhitespace(v, nMax = 10)
v |
A character, (haven_)labelled or factor variable to check. |
nMax |
The maximum number of problematic values to report.
Default is |
A checkResult
with three entires:
$problem
(a logical indicating whether any whitespaces were
fount), $message
(a message describing which values were prefixed
or suffixed with whitespace) and $problemValues
(the problematic
values). Note that only unique values are printed in the message, and that
they are sorted alphabetically.
check
, allCheckFunctions
,
checkFunction
, checkResult
identifyWhitespace(c("a", " b", "c", "d ", "e "))
identifyWhitespace(c("a", " b", "c", "d ", "e "))
A checkFunction
that checks if v
consists exclusively
of valid Danish civil registration (CPR) numbers, ignoring missing values. This
function is intended for use as a precheck in makeDataReport
, ensuring
that CPR numbers are not included in a dataMaid
output document.
isCPR(v, ...)
isCPR(v, ...)
v |
A variable (vector) to check. This variable is allowed to have any class. |
... |
Not in use. |
A checkResult
with three entires:
$problem
(a logical indicating whether the variable consists
of CPR numbers), $message
(if a problem was found,
the following message: "Warning: The variable seems to consist of
Danish civil registration (CPR) numbers.",
otherwise "") and $problemValues
(always NULL
).
check
, allCheckFunctions
,
checkFunction
, checkResult
CPRs <- sapply(c("01011988", "02011987", "04052006", "01021990", "01021991", "01021993", "01021994", "01021995", "01021996", "01021997", "01021970", "01021971", "01021972", "01021973", "01021974"), dataMaid:::makeCPR) nonCPRs <- c(1:10) mixedCPRs <- c(CPRs, nonCPRs) #identify problem isCPR(CPRs) #no problem as there are no CPRs isCPR(nonCPRs) #no problem because not ALL values are CPRs isCPR(mixedCPRs)
CPRs <- sapply(c("01011988", "02011987", "04052006", "01021990", "01021991", "01021993", "01021994", "01021995", "01021996", "01021997", "01021970", "01021971", "01021972", "01021973", "01021974"), dataMaid:::makeCPR) nonCPRs <- c(1:10) mixedCPRs <- c(CPRs, nonCPRs) #identify problem isCPR(CPRs) #no problem as there are no CPRs isCPR(nonCPRs) #no problem because not ALL values are CPRs isCPR(mixedCPRs)
A checkFunction
that checks if v
is a key, that is, if every observation has a unique value in v
and
v
is not a numeric/integer nor a Date variable. This
function is intended for use as a precheck in makeDataReport
.
isKey(v)
isKey(v)
v |
A variable (vector) to check. All variable types are allowed. |
Note that numeric or integer variables are not considered candidates for keys, as truly continuous measurements will most likely result in unique values for each observation.
A checkResult
with three entires:
$problem
(a logical indicating whether v
is a key),
$message
(if a problem was found, the following message:
"The variable is a key (distinct values for each observation).",
otherwise "") and $problemValues
(always NULL
).
check
, allCheckFunctions
,
checkFunction
, checkResult
keyVar <- c("a", "b", "c", "d", "e", "f") notKeyVar <- c("a", "a", "b", "c", "d", "e", "f") isKey(keyVar) isKey(notKeyVar)
keyVar <- c("a", "b", "c", "d", "e", "f") notKeyVar <- c("a", "a", "b", "c", "d", "e", "f") isKey(keyVar) isKey(notKeyVar)
A checkFunction
that checks if v
only
contains a single unique value, aside from missing values. This
function is intended for use as a precheck in makeDataReport
.
isSingular(v) isEmpty(v)
isSingular(v) isEmpty(v)
v |
A variable (vector) to check. All variable types are allowed. |
A checkResult
with three entires:
$problem
(a logical indicating whether v
contains only one value),
$message
(if a problem was found, a message describing which single
value the variable takes and how many missing observations it contains, otherwise
""), and $problemValues
(always NULL
).
check
, allCheckFunctions
,
checkFunction
, checkResult
singularVar <- c(rep("a", 10), NA, NA) notSingularVar <- c("a", "a", "b", "c", "d", "e", "f", NA, NA) isSingular(singularVar) isSingular(notSingularVar)
singularVar <- c(rep("a", 10), NA, NA) notSingularVar <- c("a", "a", "b", "c", "d", "e", "f", NA, NA) isSingular(singularVar) isSingular(notSingularVar)
A checkFunction
that checks if v
has
one of the classes supported by dataMaid, namely character
,
factor
, numeric
, integer
, labelled
,
haven_labelled
,
logical
and Date
(inlcuding other classes that inherits
from any of these classes). A user supported list can be provided
in the treatXasY
argument, which will let the user decide
how unsupported classes should be treated. This
function is intended for use as a precheck in makeDataReport
.
isSupported(v)
isSupported(v)
v |
A variable (vector) to check. All variable types are allowed. |
A checkResult
with three entires:
$problem
(a logical indicating whether v
contains only one value),
$message
(if a problem was found, a message describing which single
value the variable takes and how many missing observations it contains, otherwise
""), and $problemValues
(always NULL
).
check
, allCheckFunctions
,
checkFunction
, checkResult
integerVar <- 1:10 #supported rawVar <- as.raw(1:10) #not supported isSupported(integerVar) isSupported(rawVar)
integerVar <- 1:10 #supported rawVar <- as.raw(1:10) #not supported isSupported(integerVar) isSupported(rawVar)
Make a data codebook that summarizes the contents of a dataset. The result is saved to an R markdown file which can be rendered into an easy-to-read codebook in pdf, html or word formats.
makeCodebook(data, vol = "", reportTitle = NULL, file = NULL, ...)
makeCodebook(data, vol = "", reportTitle = NULL, file = NULL, ...)
data |
The dataset to be checked. This dataset should be of class |
vol |
Extra text string or numeric that is appended on the end of the output
file name(s). For example, if the dataset is called "myData", no file argument is
supplied and |
reportTitle |
A text string. If supplied, this will be the printed title of the report. If left unspecified, the title with the name of the supplied dataset. |
file |
The filename of the outputted rmarkdown (.Rmd) file.
If set to |
... |
Additional parameters passed to |
Petersen AH, Ekstrøm CT (2019). “dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R.” _Journal of Statistical Software_, *90*(6), 1-38. doi: 10.18637/jss.v090.i06 ( https://doi.org/10.18637/jss.v090.i06).
Make a data overview report that summarizes the contents of a dataset and flags potential problems. The potential problems are identified by running a set of class-specific validation checks, so that different checks are performed on different variables types. The checking steps can be customized according to user input and/or data type of the inputted variable. The checks are saved to an R markdown file which can rendered into an easy-to-read data report in pdf, html or word formats. This report also includes summaries and visualizations of each variable in the dataset.
makeDataReport( data, output = NULL, render = TRUE, useVar = NULL, ordering = c("asIs", "alphabetical"), onlyProblematic = FALSE, labelled_as = c("factor"), mode = c("summarize", "visualize", "check"), smartNum = TRUE, preChecks = c("isKey", "isSingular", "isSupported"), file = NULL, replace = FALSE, vol = "", standAlone = TRUE, twoCol = TRUE, quiet = TRUE, openResult = TRUE, summaries = setSummaries(), visuals = setVisuals(), checks = setChecks(), listChecks = TRUE, maxProbVals = 10, maxDecimals = 2, addSummaryTable = TRUE, codebook = FALSE, reportTitle = NULL, treatXasY = NULL, includeVariableList = TRUE, ... )
makeDataReport( data, output = NULL, render = TRUE, useVar = NULL, ordering = c("asIs", "alphabetical"), onlyProblematic = FALSE, labelled_as = c("factor"), mode = c("summarize", "visualize", "check"), smartNum = TRUE, preChecks = c("isKey", "isSingular", "isSupported"), file = NULL, replace = FALSE, vol = "", standAlone = TRUE, twoCol = TRUE, quiet = TRUE, openResult = TRUE, summaries = setSummaries(), visuals = setVisuals(), checks = setChecks(), listChecks = TRUE, maxProbVals = 10, maxDecimals = 2, addSummaryTable = TRUE, codebook = FALSE, reportTitle = NULL, treatXasY = NULL, includeVariableList = TRUE, ... )
data |
The dataset to be checked. This dataset should be of class |
output |
Output format. Options are |
render |
Should the output file be rendered (defaults to |
useVar |
Variables to describe in the report.
If |
ordering |
Choose the ordering of the variables in the variable presentation. The options are "asIs" (ordering as in the dataset) and "alphabetical" (alphabetical order). |
onlyProblematic |
A logical. If |
labelled_as |
A string explaining the way to handle labelled and haven_labelled vectors.
Currently |
mode |
Vector of tasks to perform among the three categories "summarize", "visualize" and "check".
The default, |
smartNum |
If |
preChecks |
Vector of function names for check functions used in the pre-check stage. The pre-check stage consists of variable checks that should be performed before the summary/visualization/checking step. If any of these checks find problems, the variable will not be summarized nor visualized nor checked. |
file |
The filename of the outputted rmarkdown (.Rmd) file.
If set to |
replace |
If |
vol |
Extra text string or numeric that is appended on the end of the output
file name(s). For example, if the dataset is called "myData", no file argument is
supplied and |
standAlone |
A logical. If |
twoCol |
A logical. Should the results from the summarize and visualize
steps be presented in two columns? Defaults to |
quiet |
A logical. If |
openResult |
A logical. If |
summaries |
A list of summaries to use on each supported variable type. We recommend
using |
visuals |
A list of visual functions to use on each supported variable type. We recommend
using |
checks |
A list of checks to use on each supported variable type. We recommend
using |
listChecks |
A logical. Controls whether what checks that were used for each
possible variable type are summarized in the output. Defaults to |
maxProbVals |
A positive integer or |
maxDecimals |
A positive integer or |
addSummaryTable |
A logical. If |
codebook |
A logical. Defaults to |
reportTitle |
A text string. If supplied, this will be the printed title of the report. If left unspecified, the title with the name of the supplied dataset. |
treatXasY |
A list that indicates how non-standard variable classes should be treated.
This parameter allows you to include variables that are not of class |
includeVariableList |
A logical indicating whether the results of the summarize/visualize/check-steps
should be added to the report. Defaults to |
... |
Other arguments that are passed on the to precheck, checking, summary and visualization functions. |
For each variable, a set of pre-check functions (controlled by the
preChecks
argument) are first run and then then a battery of
functions are applied depending on the variable class. For each
variable type the summarize/visualize/check functions are applied
and and the results are written to an R markdown file.
The function does not return anything. Its side effect (the production of a data report) is the reason for running the function.
Petersen AH, Ekstrøm CT (2019). “dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R.” _Journal of Statistical Software_, *90*(6), 1-38. doi: 10.18637/jss.v090.i06 ( https://doi.org/10.18637/jss.v090.i06).
data(testData) data(toyData) check(toyData) ## Not run: DF <- data.frame(x = 1:15) makeDataReport(DF) ## End(Not run) ## Not run: data(testData) makeDataReport(testData) ## End(Not run) # Overwrite any existing files generated by makeDataReport ## Not run: makeDataReport(testData, replace=TRUE) ## End(Not run) # Change output format to Word/docx: ## Not run: makeDataReport(testData, replace=TRUE, output = "word") ## End(Not run) # Only include problematic variables in the output document ## Not run: makeDataReport(testData, replace=TRUE, onlyProblematic=TRUE) ## End(Not run) # Add user defined check-function to the checks performed on character variables: # Here we add functionality to search for the string wally (ignoring case) ## Not run: wheresWally <- function(v, ...) { res <- grepl("wally", v, ignore.case=TRUE) problem <- any(res) message <- "Wally was found in these data" checkResult(list(problem = problem, message = message, problemValues = v[res])) } wheresWally <- checkFunction(wheresWally, description = "Search for the string 'wally' ignoring case", classes = c("character") ) # Add the newly defined function to the list of checks used for characters. makeDataReport(testData, checks = setChecks(character = defaultCharacterChecks(with = "wheresWally")), replace=TRUE) ## End(Not run) #Handle non-supported variable classes using treatXasY: treat raw as character and #treat complex as numeric. We also add a list variable, but as lists are not #handled through treatXasY, this variable will be caught in the preChecks and skipped: ## Not run: toyData$rawVar <- as.raw(c(1:14, 1)) toyData$compVar <- c(1:14, 1) + 2i toyData$listVar <- as.list(c(1:14, 1)) makeDataReport(toyData, replace = TRUE, treatXasY = list(raw = "character", complex = "numeric")) ## End(Not run)
data(testData) data(toyData) check(toyData) ## Not run: DF <- data.frame(x = 1:15) makeDataReport(DF) ## End(Not run) ## Not run: data(testData) makeDataReport(testData) ## End(Not run) # Overwrite any existing files generated by makeDataReport ## Not run: makeDataReport(testData, replace=TRUE) ## End(Not run) # Change output format to Word/docx: ## Not run: makeDataReport(testData, replace=TRUE, output = "word") ## End(Not run) # Only include problematic variables in the output document ## Not run: makeDataReport(testData, replace=TRUE, onlyProblematic=TRUE) ## End(Not run) # Add user defined check-function to the checks performed on character variables: # Here we add functionality to search for the string wally (ignoring case) ## Not run: wheresWally <- function(v, ...) { res <- grepl("wally", v, ignore.case=TRUE) problem <- any(res) message <- "Wally was found in these data" checkResult(list(problem = problem, message = message, problemValues = v[res])) } wheresWally <- checkFunction(wheresWally, description = "Search for the string 'wally' ignoring case", classes = c("character") ) # Add the newly defined function to the list of checks used for characters. makeDataReport(testData, checks = setChecks(character = defaultCharacterChecks(with = "wheresWally")), replace=TRUE) ## End(Not run) #Handle non-supported variable classes using treatXasY: treat raw as character and #treat complex as numeric. We also add a list variable, but as lists are not #handled through treatXasY, this variable will be caught in the preChecks and skipped: ## Not run: toyData$rawVar <- as.raw(c(1:14, 1)) toyData$compVar <- c(1:14, 1) + 2i toyData$listVar <- as.list(c(1:14, 1)) makeDataReport(toyData, replace = TRUE, treatXasY = list(raw = "character", complex = "numeric")) ## End(Not run)
Helper function for producing output messages for
checkFunction
type functions.
messageGenerator( problemStatus, message = "Note that a check function found the following problematic values:", nMax = 10 )
messageGenerator( problemStatus, message = "Note that a check function found the following problematic values:", nMax = 10 )
problemStatus |
A list consisting of two entries:
|
message |
Optional, but recommended. A message describing what problem the
problem values are related to. If |
nMax |
Maximum number of problem values to be printed in the message. If the total
number of problem values exceeds nMax, the number of omitted problem
values are added to the message. Defaults to |
This function is a tool for building checkFunction
s for the
dataMaid
makeDataReport
function. checkFunction
s will often identify a number
of values in a variable that are somehow problematic. messageGenerator
takes
these values, pastes them together with a problem description and makes sure that the
formatting is appropriate for being rendered in a rmarkdown
document.
We recommend writing short and precise problem descriptions (see examples),
but if no message is supplied, the following message is generated:
"Note that a check function found the following problematic values: [problem values]".
A character string with a problem description.
check
, checkFunction
, makeDataReport
#Varibales with/without underscores noUSVar <- c(1:10) USVar <- c("_a", "n_b", "b_", "_", 1:10) #Define a checkFunction using messageGenerator with a manual #problem description: identifyUnderscores <- function(v, nMax = Inf) { v <- as.character(v) underscorePlaces <- regexpr("_", v) > 0 problemValues <- unique(v[underscorePlaces]) problem <- any(underscorePlaces) message <- messageGenerator(list(problemValues = problemValues, problem = problem), "The following values contain underscores:", nMax = nMax) checkResult(list(problem = problem, message = message, problemValues = problemValues)) } identifyUnderscores(noUSVar) #no problem identifyUnderscores(USVar) #problems #Only print the first two problemvalues in the message: identifyUnderscores(USVar, nMax = 2) #Define same function, but without a manual problem description in #the messageGenerator-call: identifyUnderscores2 <- function(v, nMax = Inf) { v <- as.character(v) underscorePlaces <- regexpr("_", v) > 0 problemValues <- unique(v[underscorePlaces]) problem <- any(underscorePlaces) message <- messageGenerator(list(problemValues = problemValues, problem = problem), nMax = nMax) checkResult(list(problem = problem, message = message, problemValues = problemValues)) } identifyUnderscores2(noUSVar) #no problem identifyUnderscores2(USVar) #problems
#Varibales with/without underscores noUSVar <- c(1:10) USVar <- c("_a", "n_b", "b_", "_", 1:10) #Define a checkFunction using messageGenerator with a manual #problem description: identifyUnderscores <- function(v, nMax = Inf) { v <- as.character(v) underscorePlaces <- regexpr("_", v) > 0 problemValues <- unique(v[underscorePlaces]) problem <- any(underscorePlaces) message <- messageGenerator(list(problemValues = problemValues, problem = problem), "The following values contain underscores:", nMax = nMax) checkResult(list(problem = problem, message = message, problemValues = problemValues)) } identifyUnderscores(noUSVar) #no problem identifyUnderscores(USVar) #problems #Only print the first two problemvalues in the message: identifyUnderscores(USVar, nMax = 2) #Define same function, but without a manual problem description in #the messageGenerator-call: identifyUnderscores2 <- function(v, nMax = Inf) { v <- as.character(v) underscorePlaces <- regexpr("_", v) > 0 problemValues <- unique(v[underscorePlaces]) problem <- any(underscorePlaces) message <- messageGenerator(list(problemValues = problemValues, problem = problem), nMax = nMax) checkResult(list(problem = problem, message = message, problemValues = problemValues)) } identifyUnderscores2(noUSVar) #no problem identifyUnderscores2(USVar) #problems
A summaryFunction
, intended to be called from
summarize
, which returns the minimum and maximum values of a variable.
NA, NaN and Inf values are removed prior to the computations.
minMax(v, maxDecimals = 2)
minMax(v, maxDecimals = 2)
v |
A variable (vector) of type numeric or integer. |
maxDecimals |
A positive integer or |
An object of class summaryResult
with the following entries: $feature
("Min. and max."), $result
(the minimum and maximum of v
), and $value
(minimum and maximum in their orignial format).
summaryFunction
, summarize
, summaryResult
,
allSummaryFunctions
minMax(c(1:100))
minMax(c(1:100))
A dataset with information about the first 45 US presidents as well as a 46th
person, who is not a US president. The dataset was constructed to show the capabilities
of dataMaid
and therefore, it has been constructed to include errors and miscodings.
Each observation in the dataset corresponds to a person. The dataset uses the
non-standard class Name
which is simply an attribute that has been added to
two variables in order to show how dataMaid
handles non-supported classes.
presidentData
presidentData
A data frame with 46 rows and 11 variables.
A Name
type variable containing the last name of the president.
A Name
type variable containing the first name of the president.
A factor variable indicating the order of the presidents (with George Washington as number 1 and Donald Trump as number 45).
A Date variable with the birthday of the president
A character variable with the state in which the president was born.
A numeric variable indicating whether there was an assassination
attempt (1
) or not (0
) on the president.
A factor variable with the sex of the president.
A factor variable with the ethnicity of the president.
A numeric variable with the duration of the presidency, in years.
A character variable with the age at inauguration.
A complex
type variable with a fictional favorite number for
each president.
Artificial dataset constructed based on the US president dataset available from Data Explorer.
Petersen AH, Ekstrøm CT (2019). “dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R.” _Journal of Statistical Software_, *90*(6), 1-38. doi: 10.18637/jss.v090.i06 ( https://doi.org/10.18637/jss.v090.i06).
data(presidentData)
data(presidentData)
A summaryFunction
, intended to be called from summarize
,
which calculates the 1st and 3rd quartiles of a variable. NA, NaN and Inf values are removed
prior to the computations.
quartiles(v, maxDecimals = 2)
quartiles(v, maxDecimals = 2)
v |
A variable (vector) of type numeric or integer. |
maxDecimals |
A positive integer or |
The quartiles are computed using the quantile
function from stats
,
using type 7 quantiles for integer and numeric variables and type 1 quantiles for Date variables.
An object of class summaryResult
with the following entries: $feature
("1st and 3rd quartiles"), $result
(the 1st and 3rd quartiles of v
) and
$value
(the quartiles in their original format).
summaryFunction
, summarize
, summaryResult
,
allSummaryFunctions
quartiles(c(1:100)) quartiles(rnorm(1000), maxDecimals = 4)
quartiles(c(1:100)) quartiles(rnorm(1000), maxDecimals = 4)
A summaryFunction
, intended to be called from
summarize
, which returns the reference level of a factor variable,
i.e. the first category as returned by levels(v)
. This level will serve
as the reference category and get absorbed into the intercept for most standard
model fitting procedures and therefore, it may be convenient to know.
refCat(v, ...)
refCat(v, ...)
v |
A variable (vector) of type factor. |
... |
Not in use. |
An object of class summaryResult
with the following entries: $feature
("Reference level"), $result
(the reference level of v
), and $value
(identical to result).
summaryFunction
, summarize
, summaryResult
,
allSummaryFunctions
refCat(factor(letters))
refCat(factor(letters))
Render a Rmarkdown (.Rmd) file, file
, to the output
format specified in its preamble. If no output format is specified,
it will be rendered to html.
render(file, quiet)
render(file, quiet)
file |
A character string path to the file that is to be rendered. This file must be of type Rmarkdown (.Rmd) |
quiet |
A logical. Should messages during rendering be surpressed? |
This function is merely a simplified version (in terms of
possible arguments) of the rendering function from the rmarkdown
package.
Therefore, we refer to this functions for more details:
render
. We have included this simplified version in
dataMaid
in order to help new R users with rendering their output
documents as generated by makeDataReport
.
This function is a tool for easily specifying the checks
argument of
makeDataReport
. Note that all available check function options can be inspected
by calling allCheckFunctions()
.
setChecks( character = defaultCharacterChecks(), factor = defaultFactorChecks(), labelled = defaultLabelledChecks(), haven_labelled = defaultHavenlabelledChecks(), numeric = defaultNumericChecks(), integer = defaultIntegerChecks(), logical = defaultLogicalChecks(), Date = defaultDateChecks(), all = NULL )
setChecks( character = defaultCharacterChecks(), factor = defaultFactorChecks(), labelled = defaultLabelledChecks(), haven_labelled = defaultHavenlabelledChecks(), numeric = defaultNumericChecks(), integer = defaultIntegerChecks(), logical = defaultLogicalChecks(), Date = defaultDateChecks(), all = NULL )
character |
A character vector of function names to be used as checks for character
variables. The default options are available by calling |
factor |
A character vector of function names to be used as checks for factor
variables. The default options are available by calling |
labelled |
A character vector of function names to be used as checks for labelled
variables. The default options are available by calling |
haven_labelled |
A character vector of function names to be used as checks for haven_labelled
variables. The default options are available by calling |
numeric |
A character vector of function names to be used as checks for numeric
variables. The default options are available by calling |
integer |
A character vector of function names to be used as checks for integer
variables. The default options are available by calling |
logical |
A character vector of function names to be used as checks for logical
variables. The default options are available by calling |
Date |
A character vector of function names to be used as checks for Date
variables. The default options are available by calling |
all |
A character vector of function names to be used as checks for all variables. Note that this overrules the choices made for specific variable types by using the other arguments. |
A list with one entry for each data class supported by makeDataReport
. Each
entry then contains a character vector of function names that are to be called as checks for
that variable type.
makeDataReport
, allCheckFunctions
,
defaultCharacterChecks
,
defaultFactorChecks
, defaultLabelledChecks
,
defaultHavenlabelledChecks
,
defaultNumericChecks
, defaultIntegerChecks
,
defaultLogicalChecks
, defaultDateChecks
#Only identify missing values for characters, logicals and labelled variables: setChecks(character = "identifyMissing", factor = "identifyMissing", labelled = "identifyMissing") #Used in a call to makeDataReport(): ## Not run: data(toyData) makeDataReport(toyData, checks = setChecks(character = "identifyMissing", factor = "identifyMissing", labelled = "identifyMissing"), replace = TRUE) ## End(Not run)
#Only identify missing values for characters, logicals and labelled variables: setChecks(character = "identifyMissing", factor = "identifyMissing", labelled = "identifyMissing") #Used in a call to makeDataReport(): ## Not run: data(toyData) makeDataReport(toyData, checks = setChecks(character = "identifyMissing", factor = "identifyMissing", labelled = "identifyMissing"), replace = TRUE) ## End(Not run)
This function is a tool for easily specifying the summaries
argument of
makeDataReport
. Note that all available summary function options can be inspected
by calling allSummaryFunctions()
.
setSummaries( character = defaultCharacterSummaries(), factor = defaultFactorSummaries(), labelled = defaultLabelledSummaries(), haven_labelled = defaultHavenlabelledSummaries(), numeric = defaultNumericSummaries(), integer = defaultIntegerSummaries(), logical = defaultLogicalSummaries(), Date = defaultDateSummaries(), all = NULL )
setSummaries( character = defaultCharacterSummaries(), factor = defaultFactorSummaries(), labelled = defaultLabelledSummaries(), haven_labelled = defaultHavenlabelledSummaries(), numeric = defaultNumericSummaries(), integer = defaultIntegerSummaries(), logical = defaultLogicalSummaries(), Date = defaultDateSummaries(), all = NULL )
character |
A character vector of function names to be used as summaries for character
variables. The default options are available by calling |
factor |
A character vector of function names to be used as summaries for factor
variables. The default options are available by calling |
labelled |
A character vector of function names to be used as summaries for labelled
variables. The default options are available by calling |
haven_labelled |
A character vector of function names to be used as summaries for haven_labelled
variables. The default options are available by calling |
numeric |
A character vector of function names to be used as summaries for numeric
variables. The default options are available by calling |
integer |
A character vector of function names to be used as summaries for integer
variables. The default options are available by calling |
logical |
A character vector of function names to be used as summaries for logical
variables. The default options are available by calling |
Date |
A character vector of function names to be used as summaries for Date
variables. The default options are available by calling |
all |
A character vector of function names to be used as summaries for all variables. Note that this overrules the choices made for specific variable types by using the other arguments. |
A list with one entry for each data class supported by makeDataReport
. Each
entry then contains a character vector of function names that are to be called as summaries for
that variable type.
makeDataReport
, allSummaryFunctions
,
defaultCharacterSummaries
,
defaultFactorSummaries
, defaultLabelledSummaries
,
defaultHavenlabelledSummaries
,
defaultNumericSummaries
, defaultIntegerSummaries
,
defaultLogicalSummaries
, defaultDateSummaries
#Don't include central value (median/mode) summary for numerical and integer #variables: setSummaries(numeric = defaultNumericSummaries(remove = "centralValue"), integer = defaultIntegerSummaries(remove = "centralValue")) #Used in a call to makeDataReport(): ## Not run: data(toyData) makeDataReport(toyData, setSummaries(numeric = defaultNumericSummaries(remove = "centralValue"), integer = defaultIntegerSummaries(remove = "centralValue")), replace = TRUE) ## End(Not run)
#Don't include central value (median/mode) summary for numerical and integer #variables: setSummaries(numeric = defaultNumericSummaries(remove = "centralValue"), integer = defaultIntegerSummaries(remove = "centralValue")) #Used in a call to makeDataReport(): ## Not run: data(toyData) makeDataReport(toyData, setSummaries(numeric = defaultNumericSummaries(remove = "centralValue"), integer = defaultIntegerSummaries(remove = "centralValue")), replace = TRUE) ## End(Not run)
This function is a tool for easily specifying the visuals
argument of
makeDataReport
. Note that only a single visual function can
be provided for each variable type. If more than one is supplied, only
the first one is used. The default is to use a single visual function for all
variable types (as specified in the argument all
), but class-specific choices
of visual functions can also be used. Note that class-specific arguments overwrites
the contents of all
. Note that all available visual function options can be inspected
by calling allVisualFunctions()
.
setVisuals( character = NULL, factor = NULL, labelled = NULL, haven_labelled = NULL, numeric = NULL, integer = NULL, logical = NULL, Date = NULL, all = "standardVisual" )
setVisuals( character = NULL, factor = NULL, labelled = NULL, haven_labelled = NULL, numeric = NULL, integer = NULL, logical = NULL, Date = NULL, all = "standardVisual" )
character |
A function name (character string) to be used as the visual function for character
variables. If |
factor |
A function name (character string) to be used as the visual function for factor
variables. If |
labelled |
A function name (character string) to be used as the visual function for labelled
variables. If |
haven_labelled |
A function name (character string) to be used as the visual function for haven_labelled
variables. If |
numeric |
A function name (character string) to be used as the visual function for numeric
variables. If |
integer |
A function name (character string) to be used as the visual function for integer
variables. If |
logical |
A function name (character string) to be used as the visual function for logical
variables. If |
Date |
A function name (character string) to be used as the visual function for Date
variables. If |
all |
A function name (character string) to be used as the visual function for all variables. |
A list with one entry for each data class supported by makeDataReport
. Each
entry then contains a character string with a function name that is to be called as the visual
function for that variable type.
makeDataReport
, allVisualFunctions
#Set visual type to basicVisual for all variable types: setVisuals(all = "basicVisual") #Used in a call to makeDataReport(): ## Not run: data(toyData) makeDataReport(toyData, visuals = setVisuals(all = "basicVisual"), replace = TRUE) ## End(Not run)
#Set visual type to basicVisual for all variable types: setVisuals(all = "basicVisual") #Used in a call to makeDataReport(): ## Not run: data(toyData) makeDataReport(toyData, visuals = setVisuals(all = "basicVisual"), replace = TRUE) ## End(Not run)
Plot the distribution of a variable, depending on its data class, by use of ggplot2.
Note that standardVisual
is a visualFunction
, compatible with the
visualize
and makeDataReport
functions.
standardVisual(v, vnam, doEval = TRUE)
standardVisual(v, vnam, doEval = TRUE)
v |
The variable (vector) to be plotted. |
vnam |
The name of the variable which will appear as the title of the plot. |
doEval |
If TRUE, the plot itself is returned. Otherwise, the function returns a character string containing standalone R code for producing the plot. |
For character, factor, logical and (haven_)labelled variables, a barplot is produced. For numeric,
integer or Date variables, standardVisual
produces a histogram instead. Note that for
integer and numeric variables, all non-finite (i.e. NA
, NaN
, Inf
) values are
removed prior to plotting. For character, Date, factor, (haven_)labelled and logical variables,
only NA
values are removed.
## Not run: #Save a variable myVar <- c(1:10) #Plot a variable standardVisual(myVar, "MyVar") #Produce code for plotting a variable standardVisual(myVar, "MyVar", doEval = FALSE) ## End(Not run)
## Not run: #Save a variable myVar <- c(1:10) #Plot a variable standardVisual(myVar, "MyVar") #Produce code for plotting a variable standardVisual(myVar, "MyVar", doEval = FALSE) ## End(Not run)
Generic shell function that produces a summary of a variable (or for each variable in an entire dataset), given a number of summary functions and depending on its data class.
summarize(v, reportstyleOutput = FALSE, summaries = setSummaries(), ...)
summarize(v, reportstyleOutput = FALSE, summaries = setSummaries(), ...)
v |
The variable (vector) or dataset (data.frame) to be summarized. |
reportstyleOutput |
Logical indicating whether the output should be formatted for inclusion in the report (escaped matrix) or not. Defaults to not. |
summaries |
A list of summaries to use on each supported variable type. We recommend
using |
... |
Additional argument passed to data class specific methods. |
Summary functions are supplied using their
names (in character strings) in the class-specific argument, e.g.
characterSummaries = c("countMissing", "uniqueValues")
for character variables and
similarly for the remaining 7 data classes (factor, Date, labelled, haven_labelled, numeric, integer, logical).
Note that an overview of all available summaryFunction
s can be obtained by calling
allSummaryFunctions
.
The default choices of summaryFunctions
are available in data class specific functions, e.g.
defaultCharacterSummaries()
and defaultNumericSummaries()
.
A complete overview of all default options can be obtained by calling setSummaries()
A user defined summary function can be supplied using its function name. Note
however that it should take a vector as argument and return a list on the form
list(feature="Feature name", result="The result")
. More details on how to construct
valid summary functions are found in summaryFunction
.
The return value depends on the value of reportstyleOutput
.
If reportstyleOutput = FALSE
(the default): If v
is a varibale,
a list of summaryResult
objects, one summaryResult
for each summary
function called on v
. If v
is a dataset, then summarize()
returns
a list of lists of summaryResult
objects instead; one list for each variable
in v
.
If reportstyleOutput = TRUE
:
If v
is a single variable: A matrix with two columns, feature
and
result
and one row for each summary function that was called. Character
strings in this matrix are escaped such that they are ready for Rmarkdown rendering.
If v
is a full dataset: A list of matrices as described above, one for each
variable in the dataset.
Petersen AH, Ekstrøm CT (2019). “dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R.” _Journal of Statistical Software_, *90*(6), 1-38. doi: 10.18637/jss.v090.i06 (https://doi.org/10.18637/jss.v090.i06).
setSummaries
,
summaryFunction
, allSummaryFunctions
,
summaryResult
,
defaultCharacterSummaries
, defaultFactorSummaries
,
defaultLabelledSummaries
, defaultHavenlabelledSummaries
,
defaultNumericSummaries
, defaultIntegerSummaries
,
defaultLogicalSummaries
#Default summary for a character vector: charV <- c("a", "b", "c", "a", "a", NA, "b", "0") summarize(charV) #Inspect default character summary functions: defaultCharacterSummaries() #Define a new summary function and add it to the summary for character vectors: countZeros <- function(v, ...) { res <- length(which(v == 0)) summaryResult(list(feature="No. zeros", result = res, value = res)) } summarize(charV, summaries = setSummaries(character = defaultCharacterSummaries(add = "countZeros"))) #Does nothing, as intV is not affected by characterSummaries intV <- c(0:10) summarize(intV, summaries = setSummaries(character = defaultCharacterSummaries(add = "countZeros"))) #But supplying the argument for integer variables changes the summary: summarize(intV, summaries = setSummaries(integer = "countZeros")) #Summarize a full dataset: data(cars) summarize(cars) #Summarize a variable and obtain report-style output (formatted for markdown) summarize(charV, reportstyleOutput = TRUE)
#Default summary for a character vector: charV <- c("a", "b", "c", "a", "a", NA, "b", "0") summarize(charV) #Inspect default character summary functions: defaultCharacterSummaries() #Define a new summary function and add it to the summary for character vectors: countZeros <- function(v, ...) { res <- length(which(v == 0)) summaryResult(list(feature="No. zeros", result = res, value = res)) } summarize(charV, summaries = setSummaries(character = defaultCharacterSummaries(add = "countZeros"))) #Does nothing, as intV is not affected by characterSummaries intV <- c(0:10) summarize(intV, summaries = setSummaries(character = defaultCharacterSummaries(add = "countZeros"))) #But supplying the argument for integer variables changes the summary: summarize(intV, summaries = setSummaries(integer = "countZeros")) #Summarize a full dataset: data(cars) summarize(cars) #Summarize a variable and obtain report-style output (formatted for markdown) summarize(charV, reportstyleOutput = TRUE)
Convert a function, f
, into an S3
summaryFunction
object. This adds f
to the
overview list returned by an allSummaryFunctions()
call.
summaryFunction(f, description, classes = NULL)
summaryFunction(f, description, classes = NULL)
f |
A function. See details and examples below for the exact requirements of this function. |
description |
A character string describing the summary
returned by |
classes |
The classes for which |
summaryFunction
represents the functions used in
summarize
and makeDataReport
for summarizing the
features of variables in a dataset.
An example of defining a new summaryFunction
is given below.
Note that the minimal requirements for such a function (in order for it to be
compatible with summarize()
and makeDataReport()
) is the following
input/output-structure: It must input at least two arguments, namely
v
(a vector variable) and ...
. Additional implemented
arguments from summarize()
and makeDataReport()
include
maxDecimals
, see e.g. the pre-defined summaryFunction
minMax
for more details about how this arguments should
be used.
The output must be a list with at least the two entries $feature
(a short character string describing what was summarized) and $result
(a value or a character string with the result of the summarization).
However, if the result of a summaryFunction
is furthermore
converted to a summaryResult
object, a print()
method also becomes available for consistent formatting of
summaryFunction
results.
Note that all available summaryFunction
s are listed by the call
allSummaryFunctions()
and we recommed looking into these function,
if more knowledge about summaryFunction
s is required.
A function of class summaryFunction
which has to attributes,
namely classes
and description
.
allSummaryFunctions
, summarize
,
makeDataReport
, checkResult
#Define a valid summaryFunction that can be called from summarize() #and makeDataReport(). This function counts how many zero entries a given #variable has: countZeros <- function(v, ...) { res <- length(which(v == 0)) summaryResult(list(feature = "No. zeros", result = res, value = res)) } #Convert it to a summaryFunction object. We don't count zeros for #logical variables, as they have a different meaning here (FALSE): countZeros <- summaryFunction(countZeros, description = "Count number of zeros", classes = setdiff(allClasses(), "logical")) #Call it directly : countZeros(c(0, 0, 0, 1:100)) #Call it via summarize(): data(cars) summarize(cars, numericSummaries = c(defaultNumericSummaries(), "countZeros")) #Note that countZeros now appears in a allSummaryFunctions() call: allSummaryFunctions()
#Define a valid summaryFunction that can be called from summarize() #and makeDataReport(). This function counts how many zero entries a given #variable has: countZeros <- function(v, ...) { res <- length(which(v == 0)) summaryResult(list(feature = "No. zeros", result = res, value = res)) } #Convert it to a summaryFunction object. We don't count zeros for #logical variables, as they have a different meaning here (FALSE): countZeros <- summaryFunction(countZeros, description = "Count number of zeros", classes = setdiff(allClasses(), "logical")) #Call it directly : countZeros(c(0, 0, 0, 1:100)) #Call it via summarize(): data(cars) summarize(cars, numericSummaries = c(defaultNumericSummaries(), "countZeros")) #Note that countZeros now appears in a allSummaryFunctions() call: allSummaryFunctions()
Convert a list resulting from the summaries performed in a
summaryFunction
into a summaryResult
object, thereby
supplying it with a print()
method.
summaryResult(ls)
summaryResult(ls)
ls |
A list with entries |
A S3 object of class summaryResult
, identical to the inputted
list, ls
, except for its class attribute.
Produce a table of the distribution of a categorical (character, labelled, haven_labelled or factor) variable.
Note that tableVisual
is a visualFunction
, compatible with the
visualize
and makeDataReport
functions.
tableVisual(v, vnam, doEval = TRUE)
tableVisual(v, vnam, doEval = TRUE)
v |
The variable (vector) to be plotted. |
vnam |
The name of the variable. |
doEval |
If TRUE, the table itself is returned. Otherwise, the function returns a character string containing standalone R code for producing the table. |
visualize
, basicVisual
, standardVisual
## Not run: #Save a variable myVar <- c("red", "blue", "red", "red", NA) #Plot a variable tableVisual(myVar, "MyVar") #Produce code for plotting a variable tableVisual(myVar, "MyVar", doEval = FALSE) ## End(Not run)
## Not run: #Save a variable myVar <- c("red", "blue", "red", "red", NA) #Plot a variable tableVisual(myVar, "MyVar") #Produce code for plotting a variable tableVisual(myVar, "MyVar", doEval = FALSE) ## End(Not run)
A dataset of constructed data used as test bed when using dataMaid
for identifying
potential errors in a dataset.
testData
testData
A data frame with 15 rows and 14 variables.
A character vector with a single missing observation.
A factor vector with a miscoded missing observation, 999
.
A numeric vector
An integer vector
A logical vector with three missing observations.
A character vector with unique codes for each observation.
A numeric vector where all entries are identical.
A numeric vector with a possible outlier (100
).
A numeric vector that takes only two different values.
A character vector with levels in the format of Danish CPR numbers (social security numbers).
A character vector with levels in the format of Danish CPR numbers (social security numbers) with unique levels for each observation.
A character vector with levels corresponding to
various miscoded (non-NA
) misssing codes.
A misclassified factor variable, where every level is a number and a many (12) different levels are in use.
A Date vector.
A labelled vector with two missing observations.
Artificial data
data(testData)
data(testData)
An artificial dataset, intended for presenting the key features of dataMaid
, which is a
toolset for identifying potential errors in a dataset.
toyData
toyData
A data.frame
with 15 rows and 6 variables.
A factor variable with two levels ("red"
and "blue"
) and a few
(correctly coded) missing observations. This represents the colour of a pill.
A numeric variable with one obvious outlier value (82
), two miscoded
missing values (999
and NaN
) and a few correctly coded missing values. The number of previous events.
A factor variable where two of the levels ("other"
and "OTHER"
are the same word with different case settings. Moreover, the variable includes a Stata-style
miscoded missing value ("."
). Used to represent geographical regions or treatment centers.
.
A numeric variable (random draws from a standard normal distribution). Representing a change in a measured variable.
A factor variable with unique codes for each observation (a character string with a number between 1 and 15), i.e. a key variable.
A factor variable that has the same level ("Irrelevant"
) for all
observations, i.e. a empty variable. The latest song played on Spotify.
Artificial data
Petersen AH, Ekstrøm CT (2019). “dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R.” _Journal of Statistical Software_, *90*(6), 1-38. doi: 10.18637/jss.v090.i06 ( https://doi.org/10.18637/jss.v090.i06).
data(toyData)
data(toyData)
A summaryFunction
type function, intended to be called from
summarize
to be called from summarize
, which counts the
number of unique (excluding NA
s) values in a variable.
uniqueValues(v, ...)
uniqueValues(v, ...)
v |
A variable (vector). |
... |
Not in use. |
An object of class summaryResult
with the following entries:
$feature
("No. unique values") and $result
(the number of unique
values in v
).
summaryFunction
, summarize
, summaryResult
,
allSummaryFunctions
uniqueValues(c(1:3, rep(NA, 10), Inf, NaN))
uniqueValues(c(1:3, rep(NA, 10), Inf, NaN))
A summaryFunction
type function, intended to be called from
summarize
, which finds the
original class of a variable. This is just the class for all objects but those of class
smartNum
.
variableType(v, ...)
variableType(v, ...)
v |
A variable (vector). |
... |
Not in use. |
An object of class summaryResult
with the following entries:
$feature
("Variable type"), $result
(the (original) class of
v
) and $value
(identical to $result
).
#For standard variables: varX <- c(rep(c(1,2,3), each=10)) class(varX) variableType(varX) #For smartNum variables: smartX <- dataMaid:::smartNum(varX) class(smartX) variableType(smartX)
#For standard variables: varX <- c(rep(c(1,2,3), each=10)) class(varX) variableType(varX) #For smartNum variables: smartX <- dataMaid:::smartNum(varX) class(smartX) variableType(smartX)
Convert a function, f
, into an S3
visualFunction
object. This adds f
to the
overview list returned by an allVisualFunctions()
call.
visualFunction(f, description, classes = NULL)
visualFunction(f, description, classes = NULL)
f |
A function. See details and examples below for the exact requirements of this function. |
description |
A character string describing the visualization
returned by |
classes |
The classes for which |
visualFunction
represents the functions used in
visualize
and makeDataReport
for plotting the
distributions of the variables in a dataset.
An example of defining a new visualFunction
is given below.
Note that the minimal requirements for such a function (in order for it to be
compatible with visualize()
and makeDataReport()
) is the following
input/output-structure: It must input exactly the following three arguments,
namely v
(a vector variable), vnam
(a character string with
the name of the variable) and doEval
(a logical). The last argument
is supposed to control whether the function produces a plot in the
graphic device (if doEval = TRUE
) or instead returns a character
string including R
code for generating such a plot. In the latter
setting, the code must be stand-alone, that is, it cannot depend on object
available in an environment. In practice, this will typically imply that
the data variable is included in the code snip.
It is not strictly necessary to implement the doEval = TRUE
setting
for the visualFunction
to be compatible with makeDataReport
,
but we recommend doing it anyway such that the function can also be used
interactively.
Note that all available visualFunction
s are listed by the call
allVisualFunctions()
and we recommed looking into these function,
if more knowledge about visualFunction
s is required.
A function of class visualFunction
which has to attributes,
namely classes
and description
.
allVisualFunctions
, visualize
,
makeDataReport
#Defining a new visualFunction: mosaicVisual <- function(v, vnam, doEval) { thisCall <- call("mosaicplot", table(v), main = vnam, xlab = "") if (doEval) { return(eval(thisCall)) } else return(deparse(thisCall)) } mosaicVisual <- visualFunction(mosaicVisual, description = "Mosaicplots from graphics", classes = allClasses()) #mosaicVisual is now included in a allVisualFunctions() call: allVisualFunctions() #Create a mosaic plot: ABCvar <- c(rep("a", 10), rep("b", 20), rep("c", 5)) mosaicVisual(ABCvar, "ABCvar", TRUE) #Create a character string with the code for a mosaic plot: mosaicVisual(ABCvar, "ABCVar", FALSE) #Extract or set description of a visualFunction: description(mosaicVisual) description(mosaicVisual) <- "A cubist version of a pie chart" description(mosaicVisual)
#Defining a new visualFunction: mosaicVisual <- function(v, vnam, doEval) { thisCall <- call("mosaicplot", table(v), main = vnam, xlab = "") if (doEval) { return(eval(thisCall)) } else return(deparse(thisCall)) } mosaicVisual <- visualFunction(mosaicVisual, description = "Mosaicplots from graphics", classes = allClasses()) #mosaicVisual is now included in a allVisualFunctions() call: allVisualFunctions() #Create a mosaic plot: ABCvar <- c(rep("a", 10), rep("b", 20), rep("c", 5)) mosaicVisual(ABCvar, "ABCvar", TRUE) #Create a character string with the code for a mosaic plot: mosaicVisual(ABCvar, "ABCVar", FALSE) #Extract or set description of a visualFunction: description(mosaicVisual) description(mosaicVisual) <- "A cubist version of a pie chart" description(mosaicVisual)
Generic shell function that calls a plotting function in order to produce a marginal distribution plot for a variable (or for each variable in a dataset). What type of plot is made might depend on the data class of the variable.
visualize(v, vnam = NULL, visuals = setVisuals(), doEval = TRUE, ...)
visualize(v, vnam = NULL, visuals = setVisuals(), doEval = TRUE, ...)
v |
The variable (vector) or dataset (data.frame) which is to be plotted. |
vnam |
The name of the variable. This name might be printed on the plots, depending on the
choice of plotting function. If not supplied, it will default to the name of |
visuals |
A list of visual functions to use on each supported variable type. We recommend
using |
doEval |
A logical. If |
... |
Additional arguments used for class-specific choices of visual functions (see details). |
Visual functions can be supplied using their names (in character strings) using
setVisuals
. Note that only a single visual function is allowed for each variable class.
The default visual settings can be inspected by calling setVisuals()
.
An overview of all available visualFunction
s can be obtained by calling
allVisualFunctions
.
A user defined visual function can be supplied using its function name. Details on how
to construct valid visual functions are found in visualFunction
.
Petersen AH, Ekstrøm CT (2019). “dataMaid: Your Assistant for Documenting Supervised Data Quality Screening in R.” _Journal of Statistical Software_, *90*(6), 1-38. doi: 10.18637/jss.v090.i06 ( https://doi.org/10.18637/jss.v090.i06).
setVisuals
, allVisualFunctions
,
standardVisual
, basicVisual
#Standard use: Return standalone code for plotting a function: visualize(c(1:10), "Variable 1", doEval = FALSE) #Define a new visualization function and call it using visualize either #using allVisual or a class specific argument: mosaicVisual <- function(v, vnam, doEval) { thisCall <- call("mosaicplot", table(v), main = vnam, xlab = "") if (doEval) { return(eval(thisCall)) } else return(deparse(thisCall)) } mosaicVisual <- visualFunction(mosaicVisual, description = "Mosaicplots from graphics", classes = allClasses()) #Inspect all options for visualFunctions: allVisualFunctions() ## Not run: #set mosaicVisual for all variable types: visualize(c("1", "1", "1", "2", "2", "a"), "My variable", visuals = setVisuals(all = "mosaicVisual")) #set mosaicVisual only for character variables: visualize(c("1", "1", "1", "2", "2", "a"), "My variable", visuals = setVisuals(character = "mosaicVisual")) #this will use standardVisual, as our variable is not numeric: visualize(c("1", "1", "1", "2", "2", "a"), "My variable", visuals = setVisuals(numeric = "mosaicVisual")) ## End(Not run) #return code for a mosaic plot visualize(c("1", "1", "1", "2", "2", "a"), "My variable", allVisuals = "mosaicVisual", doEval=FALSE) ## Not run: #Produce multiple plots easily by calling visualize on a full dataset: data(testData) testData2 <- testData[, c("charVar", "factorVar", "numVar", "intVar")] visualize(testData2) #When using visualize on a dataset, datatype specific arguments have no #influence: visualize(testData2, setVisuals(character = "basicVisual", factor = "basicVisual")) #But we can still use the "all" argument in setVisuals: visualize(testData2, visuals = setVisuals(all = "basicVisual")) ## End(Not run)
#Standard use: Return standalone code for plotting a function: visualize(c(1:10), "Variable 1", doEval = FALSE) #Define a new visualization function and call it using visualize either #using allVisual or a class specific argument: mosaicVisual <- function(v, vnam, doEval) { thisCall <- call("mosaicplot", table(v), main = vnam, xlab = "") if (doEval) { return(eval(thisCall)) } else return(deparse(thisCall)) } mosaicVisual <- visualFunction(mosaicVisual, description = "Mosaicplots from graphics", classes = allClasses()) #Inspect all options for visualFunctions: allVisualFunctions() ## Not run: #set mosaicVisual for all variable types: visualize(c("1", "1", "1", "2", "2", "a"), "My variable", visuals = setVisuals(all = "mosaicVisual")) #set mosaicVisual only for character variables: visualize(c("1", "1", "1", "2", "2", "a"), "My variable", visuals = setVisuals(character = "mosaicVisual")) #this will use standardVisual, as our variable is not numeric: visualize(c("1", "1", "1", "2", "2", "a"), "My variable", visuals = setVisuals(numeric = "mosaicVisual")) ## End(Not run) #return code for a mosaic plot visualize(c("1", "1", "1", "2", "2", "a"), "My variable", allVisuals = "mosaicVisual", doEval=FALSE) ## Not run: #Produce multiple plots easily by calling visualize on a full dataset: data(testData) testData2 <- testData[, c("charVar", "factorVar", "numVar", "intVar")] visualize(testData2) #When using visualize on a dataset, datatype specific arguments have no #influence: visualize(testData2, setVisuals(character = "basicVisual", factor = "basicVisual")) #But we can still use the "all" argument in setVisuals: visualize(testData2, visuals = setVisuals(all = "basicVisual")) ## End(Not run)
Find out if the whoami package binaries is installed (git + whoami)
whoami_available()
whoami_available()
logical that is TRUE if whoami and git can be found