This function will output a summary statistics variable table either to the console or as an HTML file that can be viewed continuously while working with data, or sent to file for use elsewhere. st()
is the same thing but requires fewer key presses to type.
sumtable(
data,
vars = NA,
out = NA,
file = NA,
summ = NA,
summ.names = NA,
add.median = FALSE,
group = NA,
group.long = FALSE,
group.test = FALSE,
group.weights = NA,
group.weights.sd.type = "frequency",
col.breaks = NA,
digits = 2,
fixed.digits = FALSE,
numformat = formatfunc(digits = digits, big.mark = ""),
skip.format = c("notNA(x)", "propNA(x)", "countNA(x)", obs.function),
factor.percent = TRUE,
factor.counts = TRUE,
factor.numeric = FALSE,
logical.numeric = FALSE,
logical.labels = c("No", "Yes"),
labels = NA,
title = "Summary Statistics",
note = NA,
anchor = NA,
col.width = NA,
col.align = NA,
align = NA,
note.align = "l",
fit.page = "\\textwidth",
simple.kable = FALSE,
obs.function = NA,
opts = list()
)
st(
data,
vars = NA,
out = NA,
file = NA,
summ = NA,
summ.names = NA,
add.median = FALSE,
group = NA,
group.long = FALSE,
group.test = FALSE,
group.weights = NA,
group.weights.sd.type = "frequency",
col.breaks = NA,
digits = 2,
fixed.digits = FALSE,
numformat = formatfunc(digits = digits, big.mark = ""),
skip.format = c("notNA(x)", "propNA(x)", "countNA(x)", obs.function),
factor.percent = TRUE,
factor.counts = TRUE,
factor.numeric = FALSE,
logical.numeric = FALSE,
logical.labels = c("No", "Yes"),
labels = NA,
title = "Summary Statistics",
note = NA,
anchor = NA,
col.width = NA,
col.align = NA,
align = NA,
note.align = "l",
fit.page = "\\textwidth",
simple.kable = FALSE,
obs.function = NA,
opts = list()
)
Data set; accepts any format with column names.
Character vector of column names to include, in the order you'd like them included. Defaults to all numeric, factor, and logical variables, plus any character variables with six or fewer unique values. You can include strings that aren't columns in the data (including blanks) - these will create rows that are blank except for the string (left-aligned), for spacers or subtitles.
Determines where the completed table is sent. Set to "browser"
to open HTML file in browser using browseURL()
, "viewer"
to open in RStudio viewer using viewer()
, if available. Use "htmlreturn"
to return the HTML code to R, "latex"
to return LaTeX code to R (use "latexdoc"
to get a full buildable document rather than a fragment), "return"
to return the completed summary table to R in data frame form, or "kable"
to return it in knitr::kable()
form. Combine out = "csv"
with file
to write to CSV (dropping most formatting). Defaults to "viewer"
if RStudio is running, "browser"
if it isn't, or a "kable"
passed through kableExtra::kable_styling()
defaults if it's an RMarkdown document being built with knitr
.
Saves the completed summary table file to file with this filepath. May be combined with any value of out
, although note that out = "return"
and out = "kable"
will still save the standard sumtable HTML file as with out = "viewer"
or out = "browser"
.
Character vector of summary statistics to include for numeric and logical variables, in the form 'function(x)'
. Defaults to c('notNA(x)','mean(x)','sd(x)','min(x)','pctile(x)[25]','pctile(x)[75]','max(x)')
if there's one column, or c('notNA(x)','mean(x)','sd(x)')
if there's more than one. If all variables in a column are factors it defaults to c('sum(x)','mean(x)')
for the factor dummies. If the table has multiple variable-columns and you want different statistics in each, include a list of character vectors instead. This option is flexible, and allows any summary statistic function that takes in a column and returns a single number. For example, summ=c('mean(x)','mean(log(x))')
will provide the mean of each variable as well as the mean of the log of each variable. Keep in mind the special vtable package helper functions designed specifically for this option propNA
, countNA
, notNA
, and notNA
, which report counts and proportions of NAs, or counts of not-NAs, in the vectors, nuniq
, which reports the number of unique values, and pctile
, which returns a vector of the 100 percentiles of the variable. NAs will be omitted from all calculations other than propNA(x)
and countNA(x)
.
Character vector of names for the summary statistics included. If summ
is at default, defaults to c('N','Mean','Std. Dev.','Min','Pctl. 25','Pctl. 75','Max')
(or the appropriate shortened version with multiple columns) unless all variables in the column are factors in which case it defaults to c('N','Percent')
. If summ
has been set but summ.names
has not, defaults to summ
with the (x)
s removed and the first letter capitalized. If the table has multiple variable-columns and you want different statistics in each, include a list of character vectors instead.
Adds "median(x)"
to the set of default summary statistics. Has no effect if "summ"
is also specified.
Character variable with the name of a column in the data set that statistics are to be calculated over. Value labels will be used if found for numeric variables. Changes the default summ
to c('mean(x)','sd(x)')
.
By default, if group
is specified, each group will get its own set of columns. Set group.long = TRUE
to instead basically just make a regular sumtable()
for each group and stack them on top of each other. Good for when you have lots of groups. You can also set it to 'l'
, 'c'
, or 'r'
to determine how the group names are aligned. Defaults to centered.
Set to TRUE
to perform tests of whether each variable in the table varies over values of group
. Only works with group.long = FALSE
. Performs a joint F-test (using anova(lm))
) for numeric variables, and a Chi-square test of independence (chisq.test
) for categorical variables. If you want to adjust things like which tests are used, significance star levels, etc., see the help file for independence.test
and pass in a named list of options for that function.
THIS OPTION DOES NOT AUTOMATICALLY WEIGHT ALL CALCULATIONS. This is mostly to be used with group
and group.long = FALSE
, and while it's more flexible than that, you've gotta read this to figure out how else to use it. That's why I gave it the weird name. Set this to a vector of weights, or a string representing a column name with weights. If summ
is not customized, this will replace 'mean(x)'
and 'sd(x)'
with the equivalent weighted versions 'weighted.mean(x, w = wts)'
and 'weighted.sd(x, w = wts)'
(with type = 'frequency'
by default). It will also add weights to the default group.test
tests. This will not add weights to any other calculations, or to any custom group.test
weights (although you can always do that yourself by customizing summ
and passing in weights with this argument-the weights can be referred to in your function as wts
). This is generally intended for things like post-matching balance tables. If you specify a column name, that column will be removed from the rest of the table, so if you want it to be kept, specify this as a numeric vector instead. If you have a variable in your data called 'wts'
that will mess the use of this option up, I recommend changing that.
If group.weights
is specified, this will determine the type of standard deviation to use in the weighted calculations. Options are 'frequency'
(default), which is to be used when the weights represent frequencies, or 'precision'
, to be used when the weights represent reliability or precision of each measurement. See the weighted.sd
function for more information.
Numeric vector indicating the variables (or number of elements of vars
) after which to start a new column. So for example with a data set with six variables, c(3,5)
would put the first three variables in the first column, the next two in the middle, and the last on the right. Cannot be combined with group
unless group.long = TRUE
.
Number of digits after the decimal place to report. Set to a single number for consistent digits, or a vector the same length as summ
for different digits for each calculation, or a list of vectors that match up to a multi-column summ
. Defaults to 0 for the first calculation (N, usually) and 2 afterwards.
Deprecated; currently only works if numformat = NA
. FALSE
will cut off trailing 0
s when rounding. TRUE
retains them. Defaults to FALSE
.
A function that takes a numeric input and produces labeled output, which you might construct using the formatfunc
function or the label_
functions from the scales package. Provide a single function to apply to all variables, or a list of functions the same length as the number of variables to format each variable differently. The formatting function will skip over notNA, countNA, propNA
calculations by default. Factor percentages will ignore this entirely; you can use NA
to skip them when specifying a list. Alternately, you can specify strings giving the shorthand for the appropriate formatting: the string containing 'comma'
will set big.mark = ','
, 'decimal'
will set big.mark = '.', decimal.mark = ','
, 'percent'
will do percentage formatting (with 1 = 100%), and 'A|B'
will use 'A'
as a prefix and 'B'
as a suffix (specifying suffix optional, so numformat = '$'
gives '$3'
). Anything more complex than that will require you pass a formatfunc
or similar function. Specifying a character vector will respect your digits
option if digits
is a single value rather than a vector or list, but will otherwise use the defaults of those functions. You can mix together specifying your own functions and specifying character strings. At the moment there is no way to do different formatting for different columns of the same variable, other than skip.format
. Set to NA
to revert to the old way of formatting.
Set of functions in summ
that are not subject to format
. Does nothing if format
is not specified.
Set to TRUE
to show factor means as percentages instead of proportions, i.e. 50%
with a column header of "Percent" rather than .5
with a column header of "Mean". Defaults to TRUE
.
Set to TRUE
to show a count of each factor level in the first column. Defaults to TRUE
.
By default, factor variable dummies basically ignore the summ
argument and show count (or nothing) in the first column and percent or proportion in the second. Set this to TRUE
to instead treat the dummies like numeric binary variables with values 0 and 1.
By default, logical variables are treated as factors with TRUE = "Yes"
and FALSE = "No"
. Set this to FALSE
to instead treat them as numeric variables rather than factors, with TRUE = 1
and FALSE = 0
.
When turning logicals into factors, use these labels for FALSE
and TRUE
, respectively, rather than "No" and "Yes".
Variable labels. labels will accept four formats: (1) A vector of the same length as the number of variables in the data that will be included in the table (tricky to use if many are being dropped, also won't work for your group
variable), in the same order as the variables in the data set, (2) A matrix or data frame with two columns and more than one row, where the first column contains variable names (in any order) and the second contains labels, (3) A matrix or data frame where the column names (in any order) contain variable names and the first row contains labels, or (4) TRUE to look in the data for variable labels set by the haven package, set_label()
from sjlabelled, or label()
from Hmisc.
Character variable with the title of the table. Set to NA
to omit, which may be useful if trying to use the native Quarto cross-referencing.
Table note to go after the last row of the table. Will follow significance star note if group.test = TRUE
.
Character variable to be used to set an anchor link in HTML tables, or a label tag in LaTeX.
Vector of page-width percentages, on 0-100 scale, overriding default column widths in an HTML table. Must have a number of elements equal to the number of columns in the resulting table.
For HTML output, a character vector indicating the HTML text-align
attributes to be used in the table (for example col.align = c('left','center','center')
. Defaults to variable-name columns left-aligned and all others right-aligned (with a little extra padding between columns with col.breaks
). If you want to get tricky, you can add a ";"
afterwards and keep putting in whatever CSS attributes you want. They will be applied to the whole column.
For LaTeX output, string indicating the alignment of each column. Use standard LaTeX syntax (i.e. l|ccc
). Defaults to left in the first column and right-aligned afterwards, with @{\hskip .2in}
spacers if you have col.breaks
. If col.width
is specified, defaults to all p{}
columns with widths set by col.width
. If you want the columns aligned on a decimal point, see this explainer.
For LaTeX output, set the alignment for the multi-column table note. Usually "l", but if you have a long note in LaTeX you might want to set it with "p"
For LaTeX output, uses a resizebox to force the table to a certain width. Set to NA
to omit.
For out = 'kable'
, if you want the kable
printed to console rather than HTML or PDF, then the multi-column headers and table notes won't work. Set simple.kable = TRUE
to skip both.
The function to use (and, potentially, format) to count the number of observations for the N column. This should take a vector and return a single number or string. Uses the same string formatting as summ
. If not specified, will check if numformat
is specified using formatfunc
or a string. If not, this will be 'notNA(x)'
. If it is, will be 'notNA(x)'
with the big.mark
argument set to match the first function listed in numformat
.
The same sumtable
options as above, but in a named list format. Useful for applying the same set of options to multiple sumtable
s.
There are many, many functions in R that will produce a summary statisics table for you. So why use sumtable()
? sumtable()
serves two main purposes:
(1) In the same spirit as vtable()
, it makes it easy to view the summary statistics as you work, either in the Viewer pane or in a browser window.
(2) sumtable()
is designed to have nice defaults and is not really intended for deep customization. It's got lots of options, sure, but they're only intended to go so far. So you can have a summary statistics table without much work.
Keeping with point (2), sumtable()
is designed for use by people who want the kind of table that sumtable()
produces, which is itself heavily influenced by the kinds of summary statistics tables you often see in economics papers. In that regard it is most similar to stargazer::stargazer()
except that it can handle tibbles, factor variables, grouping, and produce multicolumn tables, or summarytools::dfSummary()
or skimr::skim()
except that it is easier to export with nice formatting. If you want a lot of control over your summary statistics table, check out the packages gtsummary, arsenal, qwraps2, or Amisc, and about a million more.
If you would like to include a sumtable
in an RMarkdown document, it should just work! If you leave out
blank, it will default to a nicely-formatted knitr::kable()
, although this will drop some formatting elements like multi-column cells (or do out="kable"
to get an unformatted kable
that you can format yourself). If you prefer the vtable
package formatting, then use out="latex"
if outputting to LaTeX or out="htmlreturn"
for HTML, both with results="asis"
in the code chunk. Alternately, in HTML, you can use the file
option to write to file and use a <iframe>
to include it.
# Examples are only run interactively because they open HTML pages in Viewer or a browser.
if (interactive()) {
data(iris)
# Sumtable handles both numeric and factor variables
st(iris)
# Output to LaTeX as well for easy integration
# with RMarkdown, or \input{} into your LaTeX docs
# (specify file too to save the result)
st(iris, out = 'latex')
# Summary statistics by group
iris$SL.above.median <- iris$Sepal.Length > median(iris$Sepal.Length)
st(iris, group = 'SL.above.median')
# Add a group test, or report by-group in "long" format
st(iris, group = 'SL.above.median', group.test = TRUE)
st(iris, group = 'SL.above.median', group.long = TRUE)
# Going all out! Adding variable labels with labels,
# spacers and variable "category" titles with vars,
# Changing the presentation of the factor variable,
# and putting the factor in its own column with col.breaks
var.labs <- data.frame(var = c('SL.above.median','Sepal.Length',
'Sepal.Width','Petal.Length',
'Petal.Width'),
labels = c('Above-median Sepal Length','Sepal Length',
'Sepal Width','Petal Length',
'Petal Width'))
st(iris,
labels = var.labs,
vars = c('Sepal Variables','SL.above.median','Sepal.Length','Sepal.Width',
'Petal Variables','Petal.Length','Petal.Width',
'Species'),
factor.percent = FALSE,
col.breaks = 7)
# Format the results
# use rep so there are enough observations to see the comma separators
irisrep = do.call('rbind', replicate(100, iris, simplify = FALSE))
# Comma separator for thousands, including for N.
st(irisrep, numformat = 'comma')
# Dollar formatting for sepal.width, decimal (1.000,00) formatting for the rest
st(iris, numformat = c('decimal','Sepal.Width' = '$'))
# Custom formatting throughout, note the big.mark = ',' will also be picked up by N
st(irisrep, numformat = formatfunc(digits = 2, nsmall = 2, big.mark = ','))
}