Appendix D — Joins

If you work with biodiversity data, it is likely that you will need to join two separate datasets at some point to analyse how spatial, temporal, or environmental factors influence species. This chapter provides a brief overview of several common types of joins in dplyr to help you get started.

For a comprehensive introduction to joins, check out the Joins chapter in R for Data Science.

D.0.1 Prerequisites

In this chapter, we will use starling occurrence data from September 2015 in the ALA.

# packages
library(galah)
library(dplyr)
library(here)
library(ggplot2)
galah_config(email = "your-email-here") # ALA-registered email

starlings <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.98d038d3-2058-4294-b683-fcb51a11f018") |>
  atlas_occurrences()

starlings_taxonomy <- galah_call() |>
  identify("Sturnidae") |>
  atlas_species()

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

starlings <- galah_call() |>
  identify("Sturnidae") |>
  filter(year == 2015,
         month == 9) |>
  select(genus, 
         species, 
         scientificName, 
         cl22,
         year,
         month, 
         decimalLatitude,
         decimalLongitude) |> 
  atlas_occurrences()
1
We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

D.1 Keys

Joining dataframes relies on setting a key—one or more columns that exist in a primary table that correspond to one or more columns in a secondary table. Two datasets that we intend to join are matched according to the designated key.

As a simple example, let’s say we want to add complete taxonomic information to our starlings dataframe, which contains occurrence records with some, but not all, levels of taxonomic information. starlings_taxonomy contains complete taxonomic information for Sturnidae.

Let’s join our starlings dataframe with starlings_taxonomy. The column genus in starlings appears to contain the same information in column genus in starlings_taxonomy.

We can use this genus column as a key to add the extra levels of taxonomic information to the table containing starling occurrence records1.

starlings |>
  left_join(starlings_taxonomy, 
            join_by(genus)) |>
  
  rmarkdown::paged_table() # paged output
Warning in left_join(starlings, starlings_taxonomy, join_by(genus)): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1805 of `x` matches multiple rows in `y`.
ℹ Row 2 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Note that we received some warnings indicating that some rows had multiple matches. Make sure to read these warning messages carefully if you ever receive them because they might tell you that something unexpected happened during your join! In our case, the warnings are the result of several genera in starlings_taxonomy having the same genus name, which is something we already knew about and doesn’t worry us.

D.2 Basic types of joins

There are many types of joins that can help you in all kinds of situations! Join types generally fall within two categories:

  • Mutating joins combine variables from two tables (e.g., left_join(), right_join(), full_join())
  • Filtering joins combine variables, and additionally keep or remove rows that do not match the key column (e.g., semi_join(), anti_join())

Below are a few common examples of join types. Examples and animations are taken from Garrick Aden-Buie’s tidyexplain animations.

x <- tibble(id = c(1, 2, 3),
            x = c("x1", "x2", "x3"))

y <- tibble(id = c(1, 2, 4),
            y = c("y1", "y2", "y4"))

D.3 Spatial joins

Summarising where species or taxonomic groups occur by grouping them by spatial regions (e.g. state, council area, bioregion) can be useful. To do this, records or summary statistics need to be linked to their corresponding regions. This typically requires joining a spatial object with a dataframe, or joining two spatial objects.

As a simple example, let’s download a shapefile of Australian states and territories using the ozmaps package. The aus object contains the name of each state/territory (NAME) and its shape (geometry)2.

library(sf)
library(ozmaps)

aus <- ozmap_states |>
  st_transform(4326)

aus
1
This line transforms the CRS projection of our map to match the CRS projection of ALA data.
Simple feature collection with 9 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 105.5507 ymin: -43.63203 xmax: 167.9969 ymax: -9.229287
Geodetic CRS:  WGS 84
# A tibble: 9 × 2
  NAME                                                                  geometry
* <chr>                                                       <MULTIPOLYGON [°]>
1 New South Wales              (((150.7016 -35.12286, 150.6611 -35.11782, 150.6…
2 Victoria                     (((146.6196 -38.70196, 146.6721 -38.70259, 146.6…
3 Queensland                   (((148.8473 -20.3457, 148.8722 -20.37575, 148.85…
4 South Australia              (((137.3481 -34.48242, 137.3749 -34.46885, 137.3…
5 Western Australia            (((126.3868 -14.01168, 126.3625 -13.98264, 126.3…
6 Tasmania                     (((147.8397 -40.29844, 147.8902 -40.30258, 147.8…
7 Northern Territory           (((136.3669 -13.84237, 136.3339 -13.83922, 136.3…
8 Australian Capital Territory (((149.2317 -35.222, 149.2346 -35.24047, 149.271…
9 Other Territories            (((167.9333 -29.05421, 167.9188 -29.0344, 167.93…

Our starlings data also contains the state/territory of each occurrence in column cl22. We can group by state/territory and summarise the number of occurrences to get an overall count by state/territory.

counts_by_state <- starlings |>
  group_by(cl22) |>
  count()

counts_by_state
# A tibble: 7 × 2
# Groups:   cl22 [7]
  cl22                             n
  <chr>                        <int>
1 Australian Capital Territory   482
2 New South Wales                779
3 Queensland                     669
4 South Australia                196
5 Tasmania                        91
6 Victoria                      1664
7 <NA>                            63

To prepare our data for mapping, we can join counts_by_state to aus using the state/territory name as our key.

aus_counts <- aus |>
  left_join(counts_by_state,
            join_by(NAME == cl22))

aus_counts
Simple feature collection with 9 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 105.5507 ymin: -43.63203 xmax: 167.9969 ymax: -9.229287
Geodetic CRS:  WGS 84
# A tibble: 9 × 3
  NAME                                                            geometry     n
  <chr>                                                 <MULTIPOLYGON [°]> <int>
1 New South Wales              (((150.7016 -35.12286, 150.6611 -35.11782,…   779
2 Victoria                     (((146.6196 -38.70196, 146.6721 -38.70259,…  1664
3 Queensland                   (((148.8473 -20.3457, 148.8722 -20.37575, …   669
4 South Australia              (((137.3481 -34.48242, 137.3749 -34.46885,…   196
5 Western Australia            (((126.3868 -14.01168, 126.3625 -13.98264,…    NA
6 Tasmania                     (((147.8397 -40.29844, 147.8902 -40.30258,…    91
7 Northern Territory           (((136.3669 -13.84237, 136.3339 -13.83922,…    NA
8 Australian Capital Territory (((149.2317 -35.222, 149.2346 -35.24047, 1…   482
9 Other Territories            (((167.9333 -29.05421, 167.9188 -29.0344, …    NA

Now we can use these data to create a choropleth map3.

ggplot() +
  geom_sf(data = aus_counts,
          aes(fill = n)) + 
  guides(fill = guide_coloursteps(title = "Number of\nObservations")) +
  scale_fill_viridis_c(option = "G") +
  theme_void()

The sf package also has specialised functions for spatial joins like st_join(), which can be especially useful for joins using points (e.g., POINT) and shapes (e.g., POLYGON, MULTIPOLYGON). Below is a small example where we use the point location to join with the state/territory. Note that we lose the POINT location in favour of the state MULTIPOLYGON shape, held in the column geometry.

# convert record coordinates to sf POINT class
starlings_sf <- starlings |>
  tidyr::drop_na() |>
  st_as_sf(coords = c("decimalLongitude", "decimalLatitude"), 
           crs = 4326)

# join points to aus states that intersect spatially
states_with_species <- st_join(x = aus,
                               y = starlings_sf,
                               join = st_intersects,
                               left = FALSE)

states_with_species |>
  rmarkdown::paged_table() # paged output

  1. Using genus is also a better choice than species or scientificName in this instance because they appear to contain similar but not exactly the same information to starlings_taxonomy$species_name. Using these columns as a key for our join would result in missing information after joining due to mismatches.↩︎

  2. This shapefile contains state/territory outlines as vectors. See this section on vectors to learn more about what a vector is.↩︎

  3. For more advanced examples of making choropleth maps, check out the ALA Labs articles counting points in shapefiles and using multiple colour scales.↩︎