--- title: "collapse's Handling of R Objects" subtitle: "A Quick View Behind the Scenes of Class-Agnostic R Programming" author: "Sebastian Krantz" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: vignette: > %\VignetteIndexEntry{collapse's Handling of R Objects} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This much-requested vignette provides some details about how *collapse* deals with various R objects. It is principally a digest of cumulative details provided in the [NEWS](https://sebkrantz.github.io/collapse/news/index.html) for various releases since v1.4.0. ## Overview *collapse* is a class-agnostic programming framework that can deal with a broad range of R objects. It provides explicit support for base R classes and data types (*integer*, *double*, *character*, *logical*, *list*, *data.frame*, *matrix*, *factor*, *Date*, *POSIXct*, *ts*), as well as *data.table*, *tibble*, *grouped_df*, *xts*, *zoo*, *pseries*, *pdata.frame*, *units*, and *sf* (no geometric operations). It also introduces [*GRP_df*](https://sebkrantz.github.io/collapse/reference/GRP.html) as a more performant and class-agnostic grouped data frame, and [*indexed_series* and *indexed_frame*](https://sebkrantz.github.io/collapse/reference/indexing.html) classes as modern class-agnostic successors of *pseries*, *pdata.frame*. These objects inherit the classes they succeed and are handled through `.pseries`, `.pdata.frame`, and `.grouped_df` methods, which also support the original (*plm* / *dplyr*) implementations (more details at the end). All other objects are handled internally at the C or R level using general principles extended by specific considerations for some of the above classes. I start with summarizing the general principles, which enable the usage of *collapse* with further classes it was not designed for. ## General Principles In general, in *collapse*, attributes and classes of R objects are preserved in statistical and data manipulation operations unless their preservation involves a **high-risk** of yielding something wrong/useless. Risky operations are those that change the dimensions or internal data type (`typeof()`) of an R object. To *collapse*'s R and C code, there exist 3 principal types of objects: atomic vectors, matrices, and lists - which are often assumed to be data frames. Most data manipulation functions in *collapse* like `fmutate()` only support lists, whereas statistical functions - notably the S3 generic [*Fast Statistical Functions*](https://sebkrantz.github.io/collapse/reference/fast-statistical-functions.html) like `fmean()` - generally support all 3 types of objects. S3 generic functions initially dispatch to `.default`, `.matrix`, `.data.frame`, and (hidden) `.list` methods. The `.list` method generally dispatches to the `.data.frame` method. These basic methods, and other non-generic functions in *collapse*, then decide how exactly to handle the object based on the statistical operation performed and attribute handling principles mostly implemented in C. The simplest case arises when an operation preserves the dimensions of the object, such as `fscale(x)` or `fmutate(data, across(a:c, log))`. In this case, all attributes of `x / data` are preserved^[Preservation implies a shallow copy of the attribute lists from the original object to the result object. A shallow copy is memory-efficient and means we are copying the list containing the attributes in memory, but not the attributes themselves. Whenever I talk about copying attributes, I mean a shallow copy, not a deep copy. You can perform shallow copies with [helper functions](https://sebkrantz.github.io/collapse/reference/small-helpers.html) `copyAttrib()` or `copyMostAttrib()`, and directly set attribute lists using `setAttrib()` or `setattrib()`.]. Another simple case for matrices and lists arises when a statistical operation reduces them to a single dimension such as `fmean(x)`, where, under the `drop = TRUE` default of [*Fast Statistical Functions*](https://sebkrantz.github.io/collapse/reference/fast-statistical-functions.html), all attributes apart from (column-)names are dropped and a (named) vector of means is returned. For atomic vectors, a statistical operation like `fmean(x)` will preserve the attributes (except for *ts* objects), as the object could have useful properties such as labels or units. More complex cases involve changing the dimensions (number of rows or columns) of an object. If the number of rows is preserved e.g. `fmutate(data, a_b = a / b)` or `flag(x, -1:1)`, only the (column-)names attribute of the object is modified. If the number of rows is reduced e.g. `fmean(x, g)`, all attributes are also retained under suitable modifications of the (row-)names attribute. However, if `x` is a matrix, other attributes than row- or column-names are only retained if `!is.object(x)`, that is, if the matrix does not have a 'class' attribute. For atomic vectors, attributes are retained if `!inherits(x, "ts")`, as aggregating a time series will break the class. This also applies to columns in a data frame being aggregated. When data is transformed using statistics as provided by the [`TRA()` function](https://sebkrantz.github.io/collapse/reference/TRA.html) e.g. `TRA(x, STATS, operation, groups)` and the like-named argument to the [*Fast Statistical Functions*](https://sebkrantz.github.io/collapse/reference/fast-statistical-functions.html), operations that simply modify the input (`x`) in a statistical sense (`"replace_NA"`, `"-"`, `"-+"`, `"/"`, `"+"`, `"*"`, `"%%"`, `"-%%"`) just copy the attributes to the transformed object. Operations `"replace_fill"` and `"replace"` are more tricky, since here `x` is replaced with `STATS`, which could be of a different class or data type. The following rules apply: (1) the result has the same data type as `STATS`; (2) if `is.object(STATS)`, the attributes of `STATS` are preserved; (3) otherwise the attributes of `x` are preserved unless `is.object(x) && typeof(x) != typeof(STATS)`; (4) an exemption to this rule is made if `x` is a factor and an integer replacement is offered to STATS e.g. `fnobs(factor, group, TRA = "replace_fill")`. In that case, the attributes of `x` are copied except for the 'class' and 'levels' attributes. These rules were devised considering the possibility that `x` may have important information attached to it which should be preserved in data transformations, such as a `"label"` attribute. Another rather complex case arises when manipulating data with *collapse* using base R functions, e.g. `BY(mtcars$mpg, mtcars$cyl, mad)` or `mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mad_mpg = mad(mpg))`. In this case, *collapse* internally uses base R functions `lapply` and `unlist()`, following efficient splitting with `gsplit()` (which preserves all attributes). Concretely, the result is computed as `y = unlist(lapply(gsplit(x, g), FUN, ...), FALSE, FALSE)`, where in the examples `x` is `mtcars$mpg`, `g` is the grouping variable(s), `FUN = mad`, and `y` is `mad(x)` in each group. To follow its policy of attribute preservation as closely as possible, *collapse* then calls an internal function `y_final = copyMostAttributes(y, x)`, which copies the attributes of `x` to `y` if both are deemed compatible^[Concretely, attributes are copied `if (typeof(x) == typeof(y) && (identical(class(x), class(y)) || typeof(y) != "integer" || inherits(x, c("IDate", "ITime"))) && !(length(x) != length(y) && inherits(x, "ts")))`. The first part of the condition is easy: if `x` and `y` are of different data types we do not copy attributes. The second condition states that to copy attributes we also need to ensure that `x` and `y` are either or the same class or `y` is not integer or `x` is not an integer-based date or time (= classes provided by *data.table*). The main reason for this clause is to guard against cases where we are counting something on an integer-based variable such as a factor e.g. `BY(factor, group, function(x) length(unique(x)))`. The case where the result is also a factor e.g. `BY(factor, group, function(x) x[1])` is dealt with because `unlist()` preserves factors, so `identical(class(x), class(y))` is `TRUE`. The last part of the expression again guards against reducing the length of univariate time series and then copying the attributes.] ($\approx$ of the same data type). If they are deemed incompatible, `copyMostAttributes` still checks if `x` has a `"label"` attribute and copies that one to `y`. So to summarize the general principles: *collapse* just tries to preserve attributes in all cases except for where it is likely to break something, beholding the way most commonly used R classes and objects behave. The most likely operations that break something are when aggregating matrices which have a class (such as *mts*/*xts*) or univariate time series (*ts*), when data is to be replaced by another object, or when applying an unknown function to a vector by groups and assembling the result with `unlist()`. In the latter cases, particular attention is paid to integer vectors and factors, as we often count something generating integers, and malformed factors need to be avoided. The following section provides some further details for some *collapse* functions and supported classes. ## Specific Functions and Classes #### Object Conversions [Quick conversion functions](https://sebkrantz.github.io/collapse/reference/quick-conversion.html) `qDF`, `qDT`, `qTBL()` and `qM` (to create data.frame's, *data.table*'s, *tibble*'s and matrices from arbitrary R objects) by default (`keep.attr = FALSE`) perform very strict conversions, where all attributes non-essential to the class are dropped from the input object. This is to ensure that, following conversion, objects behave exactly the way users expect. This is different from the behavior of functions like `as.data.frame()`, `as.data.table()`, `as_tibble()` or `as.matrix()` e.g. `as.matrix(EuStockMarkets)` just returns `EuStockMarkets` whereas `qM(EuStockMarkets)` returns a plain matrix without time series attributes. This behavior can be changed by setting `keep.attr = TRUE`, i.e. `qM(EuStockMarkets, keep.attr = TRUE)`. #### Selecting Columns by Data Type Functions [`num_vars()`, `cat_vars()` (the opposite of `num_vars()`), `char_vars()` etc.](https://sebkrantz.github.io/collapse/reference/select_replace_vars.html) are implemented in C to quickly select columns by data type without the need to check data frame columns by applying a function such as `is.numeric()` in R. For `is.numeric`, the C implementation is equivalent to `is_numeric_C <- function(x) typeof(x) %in% c("integer", "double") && (!is.object(x) || inherits(x, "ts") || inherits(x, "units") || inherits(x, "integer64"))`. This of course does not respect the behavior of other classes that define methods for `is.numeric` e.g. `is.numeric.foo <- function(x) TRUE`, then for `y = structure(rnorm(100), class = "foo")`, `is.numeric(y)` is `TRUE` but `num_vars(data.frame(y))` returns an empty frame. Correct behavior in this case requires `get_vars(data.frame(y), is.numeric)`. A particular case to be aware of here is when using `collap()` with the `FUN` and `catFUN` arguments, where the C code (`is_numeric_C`) is used internally to decide whether a column is numeric or categorical. Thus numeric columns with a class attribute other than "ts", "units", and "integer64" are not recognized as numeric by `collap()`. *collapse* also does not support statistical operations on complex data. #### Parsing of Time-IDs [*Time Series Functions*](https://sebkrantz.github.io/collapse/reference/time-series-panel-series.html) `flag`, `fdiff`, `fgrowth` and `psacf/pspacf/psccf` (and the operators `L/F/D/Dlog/G`) have a `t` argument to pass time-ids for fully identified temporal operations on time series and panel data. If `t` is a plain numeric vector or a factor, it is coerced to integer using `as.integer()`, and the integer steps are used as time steps. This is premised on the observation that the most common form of temporal identifier is a numeric variable denoting calendar years. If on the other hand `t` is a numeric time object such that `is.object(t) && is.numeric(unclass(t))` (e.g. Date, POSIXct, etc.), then it is passed through `timeid()` which computes the greatest common divisor of the vector and generates an integer time-id in that way. Users are therefore advised to use appropriate classes to represent time steps e.g. for monthly data `zoo::yearmon` would be appropriate. It is also possible to pass non-numeric `t`, such as character or list/data.frame. In such cases ordered grouping is applied to generate an integer time-id, but this should rather be avoided. #### *xts*/*zoo* Time Series *xts*/*zoo* time series are handled through `.zoo` methods to all relevant functions. These methods are simple and all follow this pattern: `FUN.zoo <- function(x, ...) if(is.matrix(x)) FUN.matrix(x, ...) else FUN.default(x, ....)`. Thus the general principles apply. Time-Series function do not automatically use the index for indexed computations, partly for consistency with native methods where this is also not the case (e.g. `lag.xts` does not perform an indexed lag), and partly because, as outlined above, the index does not necessarily accurately reflect the time structure. Thus the user must exercise discretion to perform an indexed lag on *xts*/*zoo*. For example: `flag(xts_daily, 1:3, t = index(xts_daily))` or `flag(xts_monthly, 1:3, t = zoo::as.yearmon(index(xts_monthly)))`. #### Support for *sf* and *units* *collapse* internally supports *sf* by seeking to avoid undue destruction of *sf* objects through removal of the 'geometry' column in data manipulation operations. This is simply implemented through an additional check in the C programs used to subset columns of data: if the object is an *sf* data frame, the 'geometry' column is added to the column selection. Other functions like `funique()` or `roworder()` have internal facilities to avoid sorting or grouping on the 'geometry' column. Again other functions like `descr()` and `qsu()` simply omit the geometry column in their statistical calculations. A short [vignette](https://sebkrantz.github.io/collapse/articles/collapse_and_sf.html) describes the integration of *collapse* and *sf* in a bit more detail. In summary: *collapse* supports *sf* by seeking to appropriately deal with the 'geometry' column. It cannot perform any geometrical operations, for example after subsetting *sf* with `fsubset()`, the bounding box attribute of the geometry is unaltered and likely too large. Regarding *units* objects, all relevant functions also have simple methods of the form `FUN.units <- function(x, ...) copyMostAttrib(if(is.matrix(x)) FUN.matrix(x, ...), x) else FUN.default(x, ....)`. According to the general principles, the default method preserves the units class, whereas the matrix method does not if `FUN` aggregates the data. The use of `copyMostAttrib()`, which copies all attributes apart from `"dim"`, `"dimnames"`, and `"names"`, ensures that the returned objects are still *units* objects. #### Support for *data.table* *collapse* provides quite thorough support for *data.table*. The simplest level of support is that it avoids assigning descriptive (character) row names to *data.table*'s e.g. `fmean(mtcars, mtcars$cyl)` has row-names corresponding to the groups but `fmean(qDT(mtcars), mtcars$cyl)` does not. *collapse* further supports *data.table*'s reference semantics (`set*`, `:=`). To be able to add columns by reference (e.g. `DT[, new := 1]`), *data.table*'s are implemented as overallocated lists^[Notably, additional (hidden) column pointers are allocated to be able to add columns without taking a shallow copy of the *data.table*, and an `".internal.selfref"` attribute containing an external pointer is used to check if any shallow copy was made using base R commands like `<-`.]. *collapse* copied some C code from *data.table* to do the overallocation and generate the `".internal.selfref"` attribute, so that `qDT()` creates a valid and fully functional *data.table*. To enable seamless data manipulation combining *collapse* and *data.table*, all data manipulation functions in *collapse* call this C code at the end and return a valid (overallocated) *data.table*. However, because this overallocation comes at a computational cost of 2-3 microseconds, I have opted against also adding it to the `.data.frame` methods of statistical functions. Concretely, this means that `res <- DT |> fgroup_by(id) |> fsummarise(mu_a = fmean(a))` gives a fully functional *data.table* i.e. `res[, new := 1]` works, but `res2 <- DT |> fgroup_by(id) |> fmean()` gives a non-overallocated *data.table* such that `res2[, new := 1]` will still work but issue a warning. In this case, `res2 <- DT |> fgroup_by(id) |> fmean() |> qDT()` can be used to avoid the warning. This, to me, seems a reasonable trade-off between flexibility and performance. More details and examples are provided in the [*collapse* and *data.table* vignette](https://sebkrantz.github.io/collapse/articles/collapse_and_data.table.html). #### Class-Agnostic Grouped and Indexed Data Frames As indicated in the introductory remarks, *collapse* provides a fast [class-agnostic grouped data frame](https://sebkrantz.github.io/collapse/reference/GRP.html) created with `fgroup_by()`, and fast [class-agnostic indexed time series and panel data](https://sebkrantz.github.io/collapse/reference/indexing.html), created with `findex_by()`/`reindex()`. Class-agnostic means that the object that is grouped/indexed continues to behave as before except in *collapse* operations utilizing the 'groups'/'index_df' attributes. The grouped data frame is implemented as follows: `fgroup_by()` saves the class of the input data, calls `GRP()` on the columns being grouped, and attaches the resulting 'GRP' object in a `"groups"` attribute. It then assigns a class attribute as follows ```r clx <- class(.X) # .X is the data frame being grouped, clx is its class m <- match(c("GRP_df", "grouped_df", "data.frame"), clx, nomatch = 0L) class(.X) <- c("GRP_df", if(length(mp <- m[m != 0L])) clx[-mp] else clx, "grouped_df", if(m[3L]) "data.frame") ``` In words: a class `"GRP_df"` is added in front, followed by the classes of the original object^[Removing `c("GRP_df", "grouped_df", "data.frame")` if present to avoid duplicate classes and allowing grouped data to be re-grouped.], followed by `"grouped_df"` and finally `"data.frame"`, if present. The first class `"GRP_df"` is for dealing appropriately with the object through methods for `print()` and subsetting (`[`, `[[`), e.g. `print.GRP_df` fetches the grouping object, prints `fungroup(.X)`^[Which reverses the changes of `fgroup_by()` so that the print method for the original object `.X` is called.], and then prints a summary of the grouping. `[.GRP_df` works similarly: it saves the groups, calls `[` on `fungroup(.X)`, and attaches the groups again if the result is a list with the same number of rows. So *collapse* has no issues printing and handling grouped *data.table*'s, *tibbles*, *sf* data frames, etc.: they continue to behave as usual. Now *collapse* has various functions with a `.grouped_df` method to deal with grouped data frames. For example `fmean.grouped_df`, in a nutshell, fetches the attached 'GRP' object using `GRP.grouped_df`, and calls `fmean.data.frame` on `fungroup(data)`, passing the 'GRP' object to the `g` argument for grouped computation. Here the general principles outlined above apply so that the resulting object has the same classes and attributes as the original one. This architecture has an additional advantage: it allows `GRP.grouped_df` to examine the grouping object and check if it was created by *collapse* (class 'GRP') or by *dplyr*. If the latter is the case, an efficient C routine is called to convert the *dplyr* grouping object to a 'GRP' object so that all `.grouped_df` methods in *collapse* apply to data frames created with either `dplyr::group_by()` or `fgroup_by()`. The *indexed_frame* works more or less in the same way. It inherits from *pdata.frame* so that `.pdata.frame` methods in *collapse* deal with both *indexed_frame*'s of arbitrary classes and *pdata.frame*'s created with *plm*. A notable difference to both *grouped_df* and *pdata.frame* is that *indexed_frame* is a deeply indexed data structure: each variable inside an *indexed_frame* is an *indexed_series* which contains in its *index_df* attribute an external pointer to the *index_df* attribute of the frame. Functions with *pseries* methods operating on *indexed_series* stored inside the frame (such as `with(data, flag(column))`) can fetch the index from this pointer. This allows worry-free application inside arbitrary data masking environments (`with`, `%$%`, `attach`, etc..) and estimation commands (`glm`, `feols`, `lmrob` etc..) without duplication of the index in memory. As you may have guessed, *indexed_series* are also class-agnostic and inherit from *pseries*. Any vector or matrix of any class can become an *indexed_series*. Further levels of generality are that indexed series and frames allow one, two or more variables in the index to support both time series and complex panels, natively deal with irregularity in time^[This is done through the creation of a time-factor in the *index_df* attribute whose levels represent time steps i.e. the factor will have unused levels for gaps in time.], and provide a rich set of methods for subsetting and manipulation which also subset the *index_df* attribute, including internal methods for `fsubset()`, `funique()`, `roworder(v)` and `na_omit()`. So *indexed_frame* and *indexed_series* is a rich and general structure permitting fully time-aware computations on nearly any R object. See [`?indexing`](https://sebkrantz.github.io/collapse/reference/indexing.html) for more information. ## Conclusion *collapse* handles R objects in a preserving and fairly intelligent manner, allowing seamless compatibility with many common data classes in R, and statistical workflows that preserve attributes (labels, units, etc.) of the data. This is done by general principles and some specific considerations/exemptions mostly implemented in C - as detailed in this vignette. The main benefits of this design are generality and execution speed: *collapse* has much fewer R-level method dispatches and function calls than other frameworks used to perform statistical or data manipulation operations, it behaves predictably, and may also work well with your simple new class. The main disadvantage is that many of the general principles and exemptions are hard-coded in C and thus may not work with specific classes. A prominent example where *collapse* simply fails is *lubridate*'s *interval* class ([#186](https://github.com/SebKrantz/collapse/issues/186), [#418](https://github.com/SebKrantz/collapse/issues/418)), which has a `"starts"` attribute of the same length as the data that is preserved but not subset in *collapse* operations.