2  Summarise

In the previous chapter, we learned how to get an overview of our data’s structure, including the number of rows, the columns present, and any missing data. In this chapter, we will focus on summarising ecological data across three key domains: taxonomic, spatial, and temporal. Summarising data can provide insight into the scope and variation in our dataset, and help in evaluating its suitability for our analysis.

Where possible, we will use the galah package to summarise data. galah can summarise data on the server-side before they are downloaded, enabling you to filter or summarise the data without needing to have them on your local computer first. We will demonstrate how to use galah (prior to download) or other suitable cleaning packages (after a download) when both options are available.

2.0.1 Prerequisites

In this chapter, we will use occurrence records for Alcedinidae (Kingfishers) in 2023 from the ALA.

# packages
library(galah)
library(dplyr)
library(ggplot2)
library(tidyr)
library(janitor)
galah_config(email = "your-email-here") # ALA-registered email

birds <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.75b1f2a4-eed2-4eaa-8381-b32de8994c85") |>
  atlas_occurrences()

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

birds <- galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2022) |>
  select(group = "basic", 
         family, genus, species, cl22, eventDate, month) |>
  atlas_occurrences()
1
We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

2.1 Taxonomic

2.1.1 Counts

Prior to downloading data, it can be useful to see a taxonomic breakdown of the occurrence records that exist for our query. For example, with the Alcedinidae dataset, we can count the total number of occurrence records…

galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2022) |>
  atlas_counts()
# A tibble: 1 × 1
   count
   <int>
1 142020

…or group by a taxonomic rank like genus…

galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2022) |>
  group_by(genus) |>
  atlas_counts()
# A tibble: 5 × 2
  genus       count
  <chr>       <int>
1 Dacelo      93208
2 Todiramphus 41425
3 Ceyx         6014
4 Tanysiptera   903
5 Syma          349

…or species.

galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2022) |>
  group_by(species) |>
  atlas_counts()
# A tibble: 10 × 2
   species                  count
   <chr>                    <int>
 1 Dacelo novaeguineae      85751
 2 Todiramphus sanctus      26226
 3 Todiramphus macleayii    10039
 4 Dacelo leachii            7451
 5 Ceyx azureus              5343
 6 Todiramphus pyrrhopygius  2365
 7 Tanysiptera sylvia         903
 8 Ceyx pusillus              671
 9 Syma torotoro              349
10 Todiramphus chloris         29

Our results show that the large majority of records are of Dacelo novaeguineae (aka the Laughing Kookaburra).

You can get the same summaries after downloading the data locally using dplyr or janitor.

2.2 Spatial

2.2.1 Counts by region

It can be useful to summarise occurrence numbers by a specific region. With galah, you can do this summarising prior to downloading occurrence records.

For example, you might wish to summarise your data by state/territory. We can search for the correct field to use in galah, determining that field ID cl22 contains “Australian States and Territories” and seems to suit our needs best.

search_all(fields, "states")
# A tibble: 6 × 3
  id       description                            type  
  <chr>    <chr>                                  <chr> 
1 cl2013   ASGS Australian States and Territories fields
2 cl22     Australian States and Territories      fields
3 cl927    States including coastal waters        fields
4 cl10925  PSMA States (2016)                     fields
5 cl11174  States and Territories 2021            fields
6 cl110925 PSMA States - Abbreviated (2016)       fields

Now we can use the field ID cl22 to group our counts.

galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2022) |>
  group_by(cl22) |>
  atlas_counts()
# A tibble: 8 × 2
  cl22                         count
  <chr>                        <int>
1 Queensland                   51030
2 New South Wales              39038
3 Victoria                     23941
4 Northern Territory            9872
5 Western Australia             7860
6 Australian Capital Territory  3518
7 South Australia               2504
8 Tasmania                      2502

We can also group our counts by state/territory and a taxonomic rank like genus.

galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2022) |>
  group_by(cl22, genus) |>
  atlas_counts()
# A tibble: 26 × 3
   cl22            genus       count
   <chr>           <chr>       <int>
 1 Queensland      Dacelo      27383
 2 Queensland      Todiramphus 19693
 3 Queensland      Ceyx         2633
 4 Queensland      Tanysiptera   902
 5 Queensland      Syma          347
 6 New South Wales Dacelo      29910
 7 New South Wales Todiramphus  7634
 8 New South Wales Ceyx         1483
 9 Victoria        Dacelo      19655
10 Victoria        Todiramphus  3762
# ℹ 16 more rows

Our results show that we have the most records in Queensland and New South Wales.

You can get the same summaries after downloading the data locally with dplyr and janitor.

2.2.2 Maps

We can use maps to visualise summaries of our data. To illustrate, we will use the sf package to handle spatial data, and the ozmaps package to get maps of Australia (as vector data).

library(sf)
library(ozmaps)

There are a few occurrence records in our birds dataset that are outside of Australia. For simplicity, we will filter our data to records within Australia’s land mass.

# filter records to within Australia
birds_filtered <- birds |>
  filter(decimalLongitude > 110,
         decimalLongitude < 155, 
         decimalLatitude > -45,
         decimalLatitude < -10)

Our first step is to get a map of Australia from the ozmaps package. We will transform its Coordinate Reference System (CRS)1 projection to EPSG:4326 to match the CRS projection of ALA data2.

# Get map of australia, and transform projection
aus <- ozmaps::ozmap_states |>
  st_transform(crs = st_crs(4326))

Then we can plot our occurrence points onto our map.

2.3 Temporal

2.3.1 Counts by time scales

Understanding the distribution of when observations are recorded can reveal seasonal trends among species. Checking this distribution can also help you determine whether you have enough data to infer patterns over different time spans—such as a week, month, year, decade, or even century—or whether your inferences about temporal trends are limited by the available data.

Year

For example, an easy first summary is to know the number of records in each year. You can do this in galah prior to downloading data. We can search for the correct field to use in galah, determining that field ID year seems to suit our needs best.

search_all(fields, "year")
# A tibble: 8 × 3
  id                  description            type  
  <chr>               <chr>                  <chr> 
1 year                Year                   fields
2 raw_year            Year (unprocessed)     fields
3 endDayOfYear        End Day Of Year        fields
4 startDayOfYear      Start Day Of Year      fields
5 occurrenceYear      Date (by year)         fields
6 raw_endDayOfYear    <NA>                   fields
7 raw_startDayOfYear  <NA>                   fields
8 namePublishedInYear Name Published In Year fields

Now we can use the field ID year to group our counts, returning years 2016 and onwards.

galah_call() |>
  identify("alcedinidae") |>
  filter(year > 2016) |>
  group_by(year) |>
  atlas_counts()
# A tibble: 8 × 2
  year   count
  <chr>  <int>
1 2023  172941
2 2022  142020
3 2021  129155
4 2020  109404
5 2018   96285
6 2019   94647
7 2017   79178
8 2024   62306

Alternatively, you can use the lubridate package to summarise after downloading counts.

We’ll convert our column eventDate to a date class in R. Then we can extract relevant date data…

# Using our pre-downloaded dataset
library(lubridate)

birds_date <- birds |>
  mutate(eventDate = date(eventDate), # convert to date
         year = year(eventDate),      # extract year
         month = month(eventDate,     # extract month
                       label = TRUE))

birds_date |>
  select(scientificName, eventDate, year, month)
# A tibble: 140,835 × 4
   scientificName                    eventDate   year month
   <chr>                             <date>     <dbl> <ord>
 1 Dacelo (Dacelo) novaeguineae      2022-04-19  2022 Apr  
 2 Dacelo (Dacelo) novaeguineae      2022-12-25  2022 Dec  
 3 Dacelo (Dacelo) novaeguineae      2022-10-27  2022 Oct  
 4 Dacelo (Dacelo) novaeguineae      2022-01-23  2022 Jan  
 5 Dacelo (Dacelo) novaeguineae      2022-11-09  2022 Nov  
 6 Todiramphus (Todiramphus) sanctus 2022-02-05  2022 Feb  
 7 Todiramphus (Todiramphus) sanctus 2022-11-24  2022 Nov  
 8 Dacelo (Dacelo) novaeguineae      2022-10-01  2022 Oct  
 9 Dacelo (Dacelo) novaeguineae      2022-03-21  2022 Mar  
10 Dacelo (Dacelo) novaeguineae      2022-08-14  2022 Aug  
# ℹ 140,825 more rows

…and summarise using dplyr or janitor.

Line plots

Another way to summarise temporal data is using line plots to visualise trends at different time scales over one or more years.

There are a few records that seem to be from 2021 despite downloading data for 20223. For simplicity, we’ll filter them out.

# filter dataset to 2022 only
birds_day <- birds_date |>
  filter(year(eventDate) == 2022) |>
  mutate(day = yday(eventDate))

Now we can group our records by each day of the year, and summarise the record count for each day…

birds_day <- birds_day |>
  group_by(day) |>
  summarise(count = n()) 
birds_day
# A tibble: 365 × 2
     day count
   <dbl> <int>
 1     1   892
 2     2   746
 3     3   692
 4     4   595
 5     5   461
 6     6   441
 7     7   398
 8     8   572
 9     9   683
10    10   434
# ℹ 355 more rows

…which we can visualise as a line plot. There are huge fluctuations in our daily count data (from near zero to nearly 1000 observations), so to make the plot easier to read, we can use a log10 scale.

ggplot(birds_day, aes(x = day, y = count)) +
  geom_line() +  # Add lines
  geom_point() + # Add points
  labs(x = "Day", y = "Count (log10)") +
  scale_x_continuous(breaks = seq(1, 365, by = 30)) +
  scale_y_log10() +  # Set logarithmic scale for y-axis
  theme_minimal()  # Set a minimal theme

Number of observations per day (2022)

The same method above can be used to group record counts by week4.

Code
birds_week <- birds_date |>
  filter(year(eventDate) == 2022) |>
  mutate(
    week = week(eventDate)) |>
  group_by(week) |>
  summarise(count = n()) 

ggplot(birds_week, aes(x = week, y = count)) +
  geom_line() +  # Add lines
  geom_point() + # Add points
  labs(x = "Week", y = "Count") +
  scale_x_continuous(breaks = seq(1, 52, by = 4)) + 
  theme_minimal()  # Set a minimal theme

Number of observations per week (2022)

Our temporal plots show that occurrences generally drop in the earlier months, then inflate in the later months of the year.

2.4 Summary

In this chapter we have provided a few ways to summarise your data taxonomically, spatially, and temporally. We hope that these code chunks will help you in summarising your own data. Summarising and visualising data are some of the most useful ways to spot errors for data cleaning. As such, we suggest using these tools often though the course of your analysis.

In the next part of this book, we will tackle these issues to clean your dataset.


  1. The Coordinate Reference System (CRS) determines how to display our shape of Australia, which exists on a spherical globe (the Earth), onto a flat surface (our map).↩︎

  2. Data from the ALA use EPSG:4326 (also known as “WGS84”) as the Coordinate Reference System. Transforming our map to the same projection of our data ensures the points are plotted in their actual locations on the map.↩︎

  3. This is due to timezone conversion when the ALA standardises its data. There are several timezones across Australia, so although these points might have been in 2022, once converted they fell outside of 2022!↩︎

  4. Notice, though, that we’ve ommitted the log scale because grouping by week has less variation in counts than by day (above).↩︎