Appendix D — Joins

If you work with biodiversity data, it is likely that you will need to join two separate datasets at some point to analyse how spatial, temporal, or environmental factors influence species. This chapter provides a brief overview of several common types of joins in dplyr to help you get started.

For a comprehensive introduction to joins, check out the Joins chapter in R for Data Science.

D.0.1 Prerequisites

In this chapter, we will use starling occurrence data from September 2015 in the ALA.

# packages
library(galah)
library(dplyr)
library(here)
library(ggplot2)
galah_config(email = "your-email-here") # ALA-registered email

starlings <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.98d038d3-2058-4294-b683-fcb51a11f018") |>
  atlas_occurrences()

starlings_taxonomy <- galah_call() |>
  identify("Sturnidae") |>
  atlas_species()

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

starlings <- galah_call() |>
  identify("Sturnidae") |>
  filter(year == 2015,
         month == 9) |>
  select(genus, 
         species, 
         scientificName, 
         cl22,
         year,
         month, 
         decimalLatitude,
         decimalLongitude) |> 
  atlas_occurrences()
1
We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

D.1 Keys

Joining dataframes relies on setting a key—one or more columns that exist in a primary table that correspond to one or more columns in a secondary table. Two datasets that we intend to join are matched according to the designated key.

As a simple example, let’s say we want to add complete taxonomic information to our starlings dataframe, which contains occurrence records with some, but not all, levels of taxonomic information. starlings_taxonomy contains complete taxonomic information for Sturnidae.

Let’s join our starlings dataframe with starlings_taxonomy. The column genus in starlings appears to contain the same information in column genus in starlings_taxonomy.

We can use this genus column as a key to add the extra levels of taxonomic information to the table containing starling occurrence records1.

starlings |>
  left_join(starlings_taxonomy, 
            join_by(genus)) |>
  
  rmarkdown::paged_table() # paged output
Warning in left_join(starlings, starlings_taxonomy, join_by(genus)): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1805 of `x` matches multiple rows in `y`.
ℹ Row 2 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.