February 2015, by Sean C. Anderson
This page will help you teach yourself how to rapidly explore data with the ggplot2 R package. It was written as a self-study component for the FISH554 class at the University of Washington.
Work your way through this page. When you want to check your work, or if you get stuck on an exercise, just click on the button R source
and you'll see the R source code that created a figure.
By the end you should be able to:
ggplot2 is an R package by Hadley Wickham and Winston Chang that implements Wilkinson's Grammar of Graphics. The emphasis of ggplot2 is on rapid exploration of data, and especially high-dimensional data. Think of base graphic functions as drawing with data (examples of base graphic functions are plot()
, points()
, and lines()
. With base graphics, you have complete control over every pixel in a plot but it can take a lot of time and code to produce a plot.
Although ggplot2 can be fully customized, it reaches a point of diminishing returns. I tend to use ggplot2 and base graphics for what they excel at: ggplot2 for rapid data exploration and base graphics for polished and fully-customized plots for publication.
The following figure shows figure creation time vs. data dimensions and customization level for base graphics (blue) and ggplot2 (red). Left panel: It's remarkably easy to plot high-dimensional data in ggplot2 with, for example, colours, sizes, shapes, and panels. Right panel: ggplot2 excels at rapid visual exploration of data, but has some limitations in how it can be customized. Base graphics are fully customizable but can take longer to set up. I try and exploit the grey shaded areas: I use ggplot2 for data exploration and once I've decided on a small number of key plots, I'll use base graphics to make fully-customized plots if needed. Some people get really good at customizing ggplot and stick with it for all their plots, but since you've already learned the ways of base graphics in FISH554 I think you'll benefit from the strategy in this figure:
Good graphical displays of data require rapid iteration and lots of exploration. If it takes you hours to code a plot in base graphics, you're unlikely to throw it out and explore other ways of visualizing the data, and you're unlikely to explore all the dimensions of the data.
Let's look at some illustrative ggplot2 code:
library("ggplot2")
theme_set(theme_bw()) # use the black and white theme throughout
# fake data:
d <- data.frame(x = c(1:8, 1:8), y = runif(16),
group1 = rep(gl(2, 4, labels = c("a", "b")), 2),
group2 = gl(2, 8))
head(d)
## x y group1 group2
## 1 1 0.8683116 a 1
## 2 2 0.1934542 a 1
## 3 3 0.1131743 a 1
## 4 4 0.9260514 a 1
## 5 5 0.9476787 b 1
## 6 6 0.2949107 b 1
ggplot(data = d) + geom_point(aes(x, y, colour = group1)) +
facet_grid(~group2)
The basic format in this example is:
ggplot()
: start an object and specify the data
geom_point()
: we want a scatter plot; this is called a “geom”
aes()
: specifies the “aesthetic” elements; a legend is automatically created
facet_grid()
: specifies the “faceting” or panel layout
There are also statistics, scales, and annotation options, among others. At a minimum, you must specify the data, some aesthetics, and a geom. I will elaborate on these below. Yes, ggplot2 combines elements with +
symbols! This may seem non-standard, although it has the advantage of allowing ggplot2 plots to be proper R objects, which can modified, inspected, and re-used (I provide some examples at the end).
There are two main plotting functions in ggplot2: qplot
and ggplot
. qplot
is short for "quick plot" and is made to mimic the format of plot
from base R. qplot
requires less syntax for many common tasks, but has limitations — it's essentially a wrapper for ggplot
. The ggplot
function itself isn't complicated and will work in all cases. I prefer to work with just the ggplot
syntax and will focus on it here. I find it easier to master one function that can do everything.
We're going to work with morphological data from Galapagos finches, which is available from BIRDD: Beagle Investigation Return with Darwinian Data at http://bioquest.org/birdd/morph.php. It is originally from Sato et al. 2000 Mol. Biol. Evol. http://mbe.oxfordjournals.org/content/18/3/299.full.
I've saved a .csv
version of this dataset for you. Download the data Morph_for_Sato.csv
here and put it in your R working directory.
Before we get started, we're going to clean the data up a bit. I've removed some columns and made the column names lower case. I've also removed all but one island. You can do that with this code:
morph <- read.csv("Morph_for_Sato.csv")
names(morph) <- tolower(names(morph)) # make columns names lowercase
morph <- subset(morph, islandid == "Flor_Chrl") # take only one island
morph <- morph[,c("taxonorig", "sex", "wingl", "beakh", "ubeakl")] # only keep these columns
names(morph)[1] <- "taxon"
morph <- data.frame(na.omit(morph)) # remove all rows with any NAs to make this simple
morph$taxon <- factor(morph$taxon) # remove extra remaining factor levels
morph$sex <- factor(morph$sex) # remove extra remaining factor levels
row.names(morph) <- NULL # tidy up the row names
Take a look at the data. There are columns for taxon, sex, wing length, beak height, and upper beak length:
head(morph)
str(morph)
geom
refers to a geometric object. It determines the “shape” of the plot elements. Some common geoms:
geom |
Description |
---|---|
geom_point() |
Points |
geom_line() |
Lines |
geom_ribbon() |
Ribbons, y range with continuous x values |
geom_polygon() |
Polygon, a filled path |
geom_pointrange() |
Vertical line with a point in the middle |
geom_linerange() |
An interval represented by a vertical line |
geom_path() |
Connect observations in original order |
geom_histogram() |
Histograms |
geom_text() |
Text annotations |
geom_violin() |
Violin plot (another name for a beanplot) |
geom_map() |
Polygons on a map |
Open the ggplot2 web documentation http://docs.ggplot2.org/ and keep it open to refer back to it throughout these exercises.
First, let's experiment with some geoms using the morph
dataset that you downloaded and cleaned up above. I'll start by setting up a basic scatterplot of beak height vs. wing length:
Because there's lots of overplotting, I've set the alpha (opacity) value to 40%. Alternatively, we could have added some jittering. We'll do that below.
Experiment with ggplot2 geoms. Try applying at least 3 different geoms to the morph
dataset. For example, try showing the distribution of wing length with geom_histogram()
and geom_density()
. You could also try showing the distribution of wing length for male and female birds by using geom_violin()
. After you've experimented for a while, click on R Source
to see the code behind the plots below.
Aesthetics refer to the attributes of the data you want to display. They map the data to an attribute (such as the size or shape of a symbol) and generate an appropriate legend. Aesthetics are specified with the aes()
function.
As an example, the aesthetics available for geom_point()
are: x
, y
, alpha
, colour
, fill
, shape
, and size
. (Note that ggplot tries to accommodate the user who’s never “suffered” through base graphics before by using intuitive arguments like colour
, size
, and linetype
, but ggplot will also accept arguments such as col
, cex
, and lty
.) Read the help files to see the aesthetic options for the geom you’re using. They’re generally self explanatory. Aesthetics can be specified within the data function or within a geom. If they’re specified within the data function then they apply to all geoms you specify.
Note the important difference between specifying characteristics like colour and shape inside or outside the aes()
function: those inside the aes()
function are assigned the colour or shape automatically based on the data. If characteristics like colour or shape are defined outside the aes()
function, then the characteristic is not mapped to data. Here’s an example:
Let's play with mapping some of our data to aesthetics. I'll start with one example. I'm going to map the male/female value to a colour in our scatterplot of wing length and beak height. This time I'll use jittering instead of transparency to deal with overplotting:
ggplot(morph, aes(wingl, beakh)) +
geom_point(aes(colour = sex),
position = position_jitter(width = 0.3, height = 0))
Explore the morph
dataset yourself by applying some aesthetics. You can see all the available aesthetics for a give geom by looking at the documentation. Either see the website or, for example, ?geom_point
Some suggestions:
ubeakl
) with sizeubeakl
differently than taxon
when it picks a colour scale)This last version is a bit silly, but it illustrates how quickly you can explore multiple dimensions with ggplot2.
In ggplot2 parlance, small multiples are referred to as "facets". There are two kinds: facet_wrap()
and facet_grid()
. facet_wrap()
plots the panels in the order of the factor levels. When it gets to the end of a column it wraps to the next column. You can specify the number of columns and rows with nrow
and ncol
. facet_grid()
lays out the panels in a grid with an explicit x and y position. By default all x and y axes will be shared among panels. However, you could, for example, allow the y axes to vary with facet_wrap(scales = "free_y")
or allow all axes to vary with facet_wrap(scales = free)
.
To specify the data frame columns that are mapped to the rows and columns of facets, separate them with a tilde. For example: + facet_grid(row_name~column_name)
. Usually you'll only give a row or column to facet_wrap()
, but try and see what happens if you give it both. See the help ?facet_wrap
.
Try a scatterplot of beak height against wing length with a different panel for each taxon. Use facet_wrap()
:
In some cases, it's useful to let the x or y axes have different scales for each panel. Try giving each panel a different axis here using scales = "free"
in your call to facet_wrap()
:
Now try using facet_grid
to explore the same scatterplot for each combination of sex and taxa. (Remove the scales = "free"
code for simplicity.)
As another example, let's look at the distribution of wing length by sex with different panels for each taxa. Use a boxplot or violin plot to show the distributions.
So far we've used the default settings. Let's try applying a theme and adjust some of the annotation of our plots.
A useful theme built into ggplot2 is theme_bw()
. You’ll notice that I set it as the default in this document back when I first loaded ggplot2. You can specify it for a specific plot like this:
dsamp <- diamonds[sample(nrow(diamonds), 1000), ]
ggplot(mtcars, aes(wt, mpg)) + geom_point() + theme_bw()
Alternatively, the following plot shows the default theme. The grey is designed to match the typical darkness of a page filled with text to keep the plot from drawing too much attention.
A powerful aspect of ggplot2 is that you can write your own themes. See the ggthemes package for some examples.
An Edward Tufte-like theme:
Just what you wanted: