Lecture 8: Basics of ggplot2

Nick Huntington-Klein

28 February, 2023

What is ggplot2?

  • ggplot2 is a package in R, library(ggplot2) or library(tidyverse)
  • It is based on the grammar of graphics
  • Lots of major publications use it (538, The Economist, BBC, …)
  • Zillions of extensions for doing anything you like
  • (also runs in Python using plotnine, and the newest version of seaborn is based on teh Grammar of Graphics, so everything you learn here will also teach you plotting in Python)

Resources

Why use ggplot?

  • Huge amounts of power and flexibility
  • Manipulating data easy in R
  • Easy to edit
  • Handles multidimensional data effortlessly
  • Once you know how to generate graphs in a flexible coding system, it will be easier to do what you want in code than in a more canned and restricted piece of software

Tidy Tuesday

  • Every week, the R for Data Science community posts a data set you can use
  • An assignment for this class will be to participate in (some) Tidy Tuesday weeks
  • See the repository here
  • Now on to ggplot2!

The Grammar of Graphics

  • There’s a method here!
  • We want to describe graphics in the way we use language
  • Think about syntax and nouns and verbs
  • We can build a plot from the ground up, element by element
  • This also makes it easy to make changes or switch things out

Components of the Grammar of Graphics

Components of the Grammar of Graphics

  • Data: Identify the dimensions you want to visualize. Prepare data so that it contains the points you want to visualize, or calculate from
  • Aesthetics: Axes are based on the data dimensions. Also includes other “axes” like size, shape, color, etc.
  • Scale: Do we need to scale the potential values, use a specific scale to represent multiple values or a range?

Components of the Grammar of Graphics

  • Geometric objects (geom_s): How should the data be drawn on the graph? Lines? Points? Bars?
  • Statistics: Do we want to show calculations based on data rather than data itself?
  • Facets: Do we want to spread one of the “axes” across multiple graphs?
  • Coordinate system: Should it be cartesian or polar (or something else)? Flipped? Log scale?

Setting up a ggplot Graph

We can focus, at least to start, on:

  • Data
  • Aesthetic
  • Geometry

Data

  • Data is just a regular data set (data.frame or tibble) as you’d normally work with it in R.
  • ggplot2 is set up to work directly with data frames. References to variables will look in the dataset you give it

Aesthetic

  • The aesthetic determines the “axes” of our graph
  • Literally the x= and y= axes, but also other axes on which data is differentiated, like color=, size=, linetype=, fill=, etc.
  • aes(x=mpg,y=hp,color=transmission) puts the mpg variable on the x-axis, hp on the y-axis, and colors things differently by transmission
  • group= to separate groups for things like labeling without making them visibly different
  • Different aesthetic values for different types of graphs/geometries

For Example

library(tidyverse)
mtcars <- mtcars %>%
  mutate(Transmission = factor(am, labels = c('Automatic','Manual')),
         CarName = row.names(mtcars))
ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + 
  geom_point()

Common Aesthetics

See help() for a given geometry to see what it accepts. But common are:

  • x, y: position on the x and y axis
  • color, fill: color is for coloring things generally. For shapes, color is the outline and fill is the inside
  • label: text labels
  • size: guess what this does
  • alpha: Transparency
  • linetype, linewidth: for lines or outlines, the type/thickness of line (solid, dashed…)
  • xmin/xmax/xend (and same for y): For line segments or shaded ranges, where do they start/end?

Geometry

  • The geometry is what actually gets drawn. In the last slide, geom_point() drew points.
  • There are many different geometries. Let’s look at the ggplot2 cheat sheet
  • Common: geom_point, geom_line, geom_col, geom_text

Geometry: geom_line

data(economics)
ggplot(economics, aes(x = date, y = uempmed)) + 
  geom_line()

Geometries: geom_text

ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission, label = CarName)) + 
  geom_text()

Explicit bar graphs like Excel: geom_col

data <- tibble(category = c('Apple','Banana','Carrot'),
               quality = c(6,4,3))
ggplot(data, aes(x = category,y=quality)) + geom_col()

Or explicit grouped charts: geom_col

data <- tibble(category = c('Apple','Banana','Carrot','Apple','Banana','Carrot'),
               person = c('Me','Me','Me','You','You','You'),
               quality = c(6,4,3,1,6,3))
ggplot(data, aes(x = person, y = quality, fill = category)) + geom_col(position = 'dodge')

Hm?

What was that position = 'dodge'?

  • Used when you have stacked elements with the same x aesthetic that you want to be side-by-side
  • With geom_col() or geom_bar() you can set position = 'dodge' to make the different-fill bars go side by side
  • Can apply position = position_dodge(.9) with other geometries to make them line up. For example…

Dodging

Calculations

  • Some geometries do calculations/summaries of your data before graphing
  • Common: geom_density/geom_histogram/geom_dotplot, geom_bar, geom_smooth

Demonstrating Distributions

ggplot(mtcars, aes(x = mpg)) + geom_histogram()

Demonstrating Distributions

ggplot(mtcars, aes(x = mpg)) + geom_dotplot()

Demonstrating Distributions

ggplot(mtcars, aes(x = mpg)) + geom_density()

Demonstrating Distributions

ggplot(mtcars, aes(y = mpg, x = factor(am))) + geom_violin()

Discrete Distributions: geom_bar

ggplot(mtcars, aes(x = Transmission)) + geom_bar()

Best-fit lines with geom_smooth

ggplot(mtcars, aes(x = mpg, y = hp)) + 
  geom_point() + geom_smooth()

Trendlines with geom_smooth

# Or make it straight, and separate by Transmission, and no bands
ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + 
  geom_point() + geom_smooth(method = 'lm', se = FALSE)

Imitation and Flattery

  • Uses data(iris)

Imitation and Flattery

  • Uses data(economics_long)
  • economics_long <- economics_long %>% filter(variable %in% c('psavert','uempmed')) keeps just the two variables we want

Imitation and Flattery

  • Uses data(economics)

Moving Forward

  • Everything we’ve done up to now isn’t all that different from any sort of graphing software
  • But what really makes ggplot2 shine is its consistency and internal logic
  • You’ll have full access to the scales of the graph, as well as to the components that make it up
  • And manipulating those things is fairly intuitive to do rather than head-banging work
  • That’s what we’ll work on next time!