An introduction to ggplot2

# An introduction to ggplot2

February 2015, by Sean C. Anderson

This page will help you teach yourself how to rapidly explore data with the ggplot2 R package. It was written as a self-study component for the FISH554 class at the University of Washington.

Work your way through this page. When you want to check your work, or if you get stuck on an exercise, just click on the button R source and you'll see the R source code that created a figure.

# Learning objectives

By the end you should be able to:

• Understand the basic grammar of ggplot2 (data, geoms, aesthetics, facets).
• Make quick exploratory plots of your multidimensional data.
• Have an idea about how to start customizing ggplot2 figures.
• Know how to find help on ggplot2 when you run into problems.

# ggplot2 for rapid data exploration

ggplot2 is an R package by Hadley Wickham and Winston Chang that implements Wilkinson's Grammar of Graphics. The emphasis of ggplot2 is on rapid exploration of data, and especially high-dimensional data. Think of base graphic functions as drawing with data (examples of base graphic functions are plot(), points(), and lines(). With base graphics, you have complete control over every pixel in a plot but it can take a lot of time and code to produce a plot.

Although ggplot2 can be fully customized, it reaches a point of diminishing returns. I tend to use ggplot2 and base graphics for what they excel at: ggplot2 for rapid data exploration and base graphics for polished and fully-customized plots for publication.

The following figure shows figure creation time vs. data dimensions and customization level for base graphics (blue) and ggplot2 (red). Left panel: It's remarkably easy to plot high-dimensional data in ggplot2 with, for example, colours, sizes, shapes, and panels. Right panel: ggplot2 excels at rapid visual exploration of data, but has some limitations in how it can be customized. Base graphics are fully customizable but can take longer to set up. I try and exploit the grey shaded areas: I use ggplot2 for data exploration and once I've decided on a small number of key plots, I'll use base graphics to make fully-customized plots if needed. Some people get really good at customizing ggplot and stick with it for all their plots, but since you've already learned the ways of base graphics in FISH554 I think you'll benefit from the strategy in this figure:

Good graphical displays of data require rapid iteration and lots of exploration. If it takes you hours to code a plot in base graphics, you're unlikely to throw it out and explore other ways of visualizing the data, and you're unlikely to explore all the dimensions of the data.

# Basics of the grammar

Let's look at some illustrative ggplot2 code:

library("ggplot2")
theme_set(theme_bw()) # use the black and white theme throughout
# fake data:
d <- data.frame(x = c(1:8, 1:8), y = runif(16),
group1 = rep(gl(2, 4, labels = c("a", "b")), 2),
group2 = gl(2, 8))
head(d)
##   x         y group1 group2
## 1 1 0.8683116      a      1
## 2 2 0.1934542      a      1
## 3 3 0.1131743      a      1
## 4 4 0.9260514      a      1
## 5 5 0.9476787      b      1
## 6 6 0.2949107      b      1

ggplot(data = d) + geom_point(aes(x, y, colour = group1)) +
facet_grid(~group2)

The basic format in this example is:

1. ggplot(): start an object and specify the data

2. geom_point(): we want a scatter plot; this is called a “geom”

3. aes(): specifies the “aesthetic” elements; a legend is automatically created

4. facet_grid(): specifies the “faceting” or panel layout

There are also statistics, scales, and annotation options, among others. At a minimum, you must specify the data, some aesthetics, and a geom. I will elaborate on these below. Yes, ggplot2 combines elements with + symbols! This may seem non-standard, although it has the advantage of allowing ggplot2 plots to be proper R objects, which can modified, inspected, and re-used (I provide some examples at the end).

There are two main plotting functions in ggplot2: qplot and ggplot. qplot is short for "quick plot" and is made to mimic the format of plot from base R. qplot requires less syntax for many common tasks, but has limitations — it's essentially a wrapper for ggplot. The ggplot function itself isn't complicated and will work in all cases. I prefer to work with just the ggplot syntax and will focus on it here. I find it easier to master one function that can do everything.

We're going to work with morphological data from Galapagos finches, which is available from BIRDD: Beagle Investigation Return with Darwinian Data at http://bioquest.org/birdd/morph.php. It is originally from Sato et al. 2000 Mol. Biol. Evol. http://mbe.oxfordjournals.org/content/18/3/299.full.

I've saved a .csv version of this dataset for you. Download the data Morph_for_Sato.csv here and put it in your R working directory.

Before we get started, we're going to clean the data up a bit. I've removed some columns and made the column names lower case. I've also removed all but one island. You can do that with this code:

morph <- read.csv("Morph_for_Sato.csv")
names(morph) <- tolower(names(morph)) # make columns names lowercase
morph <- subset(morph, islandid == "Flor_Chrl") # take only one island
morph <- morph[,c("taxonorig", "sex", "wingl", "beakh", "ubeakl")] # only keep these columns
names(morph)[1] <- "taxon"
morph <- data.frame(na.omit(morph)) # remove all rows with any NAs to make this simple
morph$taxon <- factor(morph$taxon) # remove extra remaining factor levels
morph$sex <- factor(morph$sex) # remove extra remaining factor levels
row.names(morph) <- NULL # tidy up the row names

Take a look at the data. There are columns for taxon, sex, wing length, beak height, and upper beak length:

head(morph)
str(morph)

# Geoms

geom refers to a geometric object. It determines the “shape” of the plot elements. Some common geoms:

geom Description
geom_point() Points
geom_line() Lines
geom_ribbon() Ribbons, y range with continuous x values
geom_polygon() Polygon, a filled path
geom_pointrange() Vertical line with a point in the middle
geom_linerange() An interval represented by a vertical line
geom_path() Connect observations in original order
geom_histogram() Histograms
geom_text() Text annotations
geom_violin() Violin plot (another name for a beanplot)
geom_map() Polygons on a map

Open the ggplot2 web documentation http://docs.ggplot2.org/ and keep it open to refer back to it throughout these exercises.

First, let's experiment with some geoms using the morph dataset that you downloaded and cleaned up above. I'll start by setting up a basic scatterplot of beak height vs. wing length:

library("ggplot2")
ggplot(morph, aes(wingl, beakh)) + geom_point(alpha = 0.4)

Because there's lots of overplotting, I've set the alpha (opacity) value to 40%. Alternatively, we could have added some jittering. We'll do that below.

Experiment with ggplot2 geoms. Try applying at least 3 different geoms to the morph dataset. For example, try showing the distribution of wing length with geom_histogram() and geom_density(). You could also try showing the distribution of wing length for male and female birds by using geom_violin(). After you've experimented for a while, click on R Source to see the code behind the plots below.

# Aesthetics

Aesthetics refer to the attributes of the data you want to display. They map the data to an attribute (such as the size or shape of a symbol) and generate an appropriate legend. Aesthetics are specified with the aes() function.

As an example, the aesthetics available for geom_point() are: x, y, alpha, colour, fill, shape, and size. (Note that ggplot tries to accommodate the user who’s never “suffered” through base graphics before by using intuitive arguments like colour, size, and linetype, but ggplot will also accept arguments such as col, cex, and lty.) Read the help files to see the aesthetic options for the geom you’re using. They’re generally self explanatory. Aesthetics can be specified within the data function or within a geom. If they’re specified within the data function then they apply to all geoms you specify.

Note the important difference between specifying characteristics like colour and shape inside or outside the aes() function: those inside the aes() function are assigned the colour or shape automatically based on the data. If characteristics like colour or shape are defined outside the aes() function, then the characteristic is not mapped to data. Here’s an example:

ggplot(mpg, aes(cty, hwy)) + geom_point(aes(colour = class))
ggplot(mpg, aes(cty, hwy)) + geom_point(colour = "red")

Let's play with mapping some of our data to aesthetics. I'll start with one example. I'm going to map the male/female value to a colour in our scatterplot of wing length and beak height. This time I'll use jittering instead of transparency to deal with overplotting:

ggplot(morph, aes(wingl, beakh)) +
geom_point(aes(colour = sex),
position = position_jitter(width = 0.3, height = 0))

Explore the morph dataset yourself by applying some aesthetics. You can see all the available aesthetics for a give geom by looking at the documentation. Either see the website or, for example, ?geom_point

Some suggestions:

• try the same scatterplot but show upper beak length (ubeakl) with size
• try the same scatterplot but show the taxon with colour
• try the same scatterplot but show the upper beak length with colour (note how ggplot treats ubeakl differently than taxon when it picks a colour scale)
• try the same scatterplot but show the sex with a different shape
• combine all these: colour for taxon, shape for sex, and size for upper beak length

This last version is a bit silly, but it illustrates how quickly you can explore multiple dimensions with ggplot2.

# Facets (small multiples)

In ggplot2 parlance, small multiples are referred to as "facets". There are two kinds: facet_wrap() and facet_grid(). facet_wrap() plots the panels in the order of the factor levels. When it gets to the end of a column it wraps to the next column. You can specify the number of columns and rows with nrow and ncol. facet_grid() lays out the panels in a grid with an explicit x and y position. By default all x and y axes will be shared among panels. However, you could, for example, allow the y axes to vary with facet_wrap(scales = "free_y") or allow all axes to vary with facet_wrap(scales = free).

To specify the data frame columns that are mapped to the rows and columns of facets, separate them with a tilde. For example: + facet_grid(row_name~column_name). Usually you'll only give a row or column to facet_wrap(), but try and see what happens if you give it both. See the help ?facet_wrap.

Try a scatterplot of beak height against wing length with a different panel for each taxon. Use facet_wrap():

In some cases, it's useful to let the x or y axes have different scales for each panel. Try giving each panel a different axis here using scales = "free" in your call to facet_wrap():

Now try using facet_grid to explore the same scatterplot for each combination of sex and taxa. (Remove the scales = "free" code for simplicity.)

As another example, let's look at the distribution of wing length by sex with different panels for each taxa. Use a boxplot or violin plot to show the distributions.

# Customizing ggplot2

So far we've used the default settings. Let's try applying a theme and adjust some of the annotation of our plots.

A useful theme built into ggplot2 is theme_bw(). You’ll notice that I set it as the default in this document back when I first loaded ggplot2. You can specify it for a specific plot like this:

dsamp <- diamonds[sample(nrow(diamonds), 1000), ]
ggplot(mtcars, aes(wt, mpg)) + geom_point() + theme_bw()

Alternatively, the following plot shows the default theme. The grey is designed to match the typical darkness of a page filled with text to keep the plot from drawing too much attention.

ggplot(mtcars, aes(wt, mpg)) + geom_point() + theme_grey()

A powerful aspect of ggplot2 is that you can write your own themes. See the ggthemes package for some examples.

An Edward Tufte-like theme:

library("ggthemes")
ggplot(mtcars, aes(wt, mpg)) + geom_point() + geom_rangeframe() +
theme_tufte()

Just what you wanted:

ggplot(dsamp, aes(carat, price, colour = cut)) + geom_point() +
theme_excel() + scale_colour_excel()