2  Summarise

In the previous chapter, we learned how to get an overview of our data’s structure, including the number of rows, the columns present, and any missing data. In this chapter, we will focus on summarising ecological data across three key domains: taxonomic, spatial, and temporal. Summarising data can provide insight into the scope and variation in our dataset, and help in evaluating its suitability for our analysis.

Where possible, we will use the galah package to summarise data. galah can summarise data on the server-side before they are downloaded, enabling you to filter or summarise the data without needing to have them on your local computer first. We will demonstrate how to use galah (prior to download) or other suitable cleaning packages (after a download) when both options are available.

2.0.1 Prerequisites

In this chapter, we will use occurrence records for Alcedinidae (Kingfishers) in 2023 from the ALA.

# packages
library(galah)
library(dplyr)
library(ggplot2)
library(tidyr)
library(janitor)
galah_config(email = "your-email-here") # ALA-registered email

birds <- galah_call() |>
  filter(doi == "https://doi.org/10.26197/ala.0e60416d-d9e5-4bf2-a0dc-f0c26ae9105a") |>
  atlas_occurrences()

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

birds <- galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2022) |>
  select(group = "basic", 
         family, genus, species, cl22, eventDate, month) |>
  atlas_occurrences()
1
We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

2.1 Taxonomic

2.1.1 Counts

Prior to downloading data, it can be useful to see a taxonomic breakdown of the occurrence records that exist for our query. For example, with the Alcedinidae dataset, we can count the total number of occurrence records…

galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2022) |>
  atlas_counts()
# A tibble: 1 × 1
   count
   <int>
1 143120

…or group by a taxonomic rank like genus…

galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2022) |>
  group_by(genus) |>
  atlas_counts()
# A tibble: 5 × 2
  genus       count
  <chr>       <int>
1 Dacelo      94027
2 Todiramphus 41640
3 Ceyx         6063
4 Tanysiptera   911
5 Syma          358

…or species.

galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2022) |>
  group_by(species) |>
  atlas_counts()
# A tibble: 10 × 2
   species                  count
   <chr>                    <int>
 1 Dacelo novaeguineae      86558
 2 Todiramphus sanctus      26399
 3 Todiramphus macleayii    10054
 4 Dacelo leachii            7464
 5 Ceyx azureus              5388
 6 Todiramphus pyrrhopygius  2386
 7 Tanysiptera sylvia         911
 8 Ceyx pusillus              675
 9 Syma torotoro              358
10 Todiramphus chloris         31

Our results show that the large majority of records are of Dacelo novaeguineae (aka the Laughing Kookaburra).

You can get the same summaries after downloading the data locally using dplyr or janitor.

2.2 Spatial

2.2.1 Counts by region

It can be useful to summarise occurrence numbers by a specific region. With galah, you can do this summarising prior to downloading occurrence records.

For example, you might wish to summarise your data by state/territory. We can search for the correct field to use in galah, determining that field ID cl22 contains “Australian States and Territories” and seems to suit our needs best.

search_all(fields, "states")
# A tibble: 6 × 3
  id       description                            type  
  <chr>    <chr>                                  <chr> 
1 cl2013   ASGS Australian States and Territories fields
2 cl22     Australian States and Territories      fields
3 cl927    States including coastal waters        fields
4 cl10925  PSMA States (2016)                     fields
5 cl11174  States and Territories 2021            fields
6 cl110925 PSMA States - Abbreviated (2016)       fields

Now we can use the field ID cl22 to group our counts.

galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2022) |>
  group_by(cl22) |>
  atlas_counts()
# A tibble: 8 × 2
  cl22                         count
  <chr>                        <int>
1 Queensland                   51131
2 New South Wales              39908
3 Victoria                     23959
4 Northern Territory            9889
5 Western Australia             7882
6 Australian Capital Territory  3529
7 South Australia               2548
8 Tasmania                      2509

We can also group our counts by state/territory and a taxonomic rank like genus.

galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2022) |>
  group_by(cl22, genus) |>
  atlas_counts()
# A tibble: 26 × 3
   cl22            genus       count
   <chr>           <chr>       <int>
 1 Queensland      Dacelo      27440
 2 Queensland      Todiramphus 19710
 3 Queensland      Ceyx         2643
 4 Queensland      Tanysiptera   910
 5 Queensland      Syma          356
 6 New South Wales Dacelo      30601
 7 New South Wales Todiramphus  7781
 8 New South Wales Ceyx         1515
 9 Victoria        Dacelo      19668
10 Victoria        Todiramphus  3766
# ℹ 16 more rows

Our results show that we have the most records in Queensland and New South Wales.

You can get the same summaries after downloading the data locally with dplyr and janitor.

2.2.2 Maps

We can use maps to visualise summaries of our data. To illustrate, we will use the sf package to handle spatial data, and the ozmaps package to get maps of Australia (as vector data).

library(sf)
library(ozmaps)

There are a few occurrence records in our birds dataset that are outside of Australia. For simplicity, we will filter our data to records within Australia’s land mass.

# filter records to within Australia
birds_filtered <- birds |>
  filter(decimalLongitude > 110,
         decimalLongitude < 155, 
         decimalLatitude > -45,
         decimalLatitude < -10)

Our first step is to get a map of Australia from the ozmaps package. We will transform its Coordinate Reference System (CRS)1 projection to EPSG:4326 to match the CRS projection of ALA data2.

# Get map of australia, and transform projection
aus <- ozmaps::ozmap_states |>
  st_transform(crs = st_crs(4326))

Then we can plot our occurrence points onto our map.

2.3 Temporal

2.3.1 Counts by time scales

Understanding the distribution of when observations are recorded can reveal seasonal trends among species. Checking this distribution can also help you determine whether you have enough data to infer patterns over different time spans—such as a week, month, year, decade, or even century—or whether your inferences about temporal trends are limited by the available data.

Year

For example, an easy first summary is to know the number of records in each year. You can do this in galah prior to downloading data. We can search for the correct field to use in galah, determining that field ID year seems to suit our needs best.

search_all(fields, "year")
# A tibble: 8 × 3
  id                  description            type  
  <chr>               <chr>                  <chr> 
1 year                Year                   fields
2 raw_year            Year (unprocessed)     fields
3 endDayOfYear        End Day Of Year        fields
4 startDayOfYear      Start Day Of Year      fields
5 occurrenceYear      Date (by year)         fields
6 raw_endDayOfYear    <NA>                   fields
7 raw_startDayOfYear  <NA>                   fields
8 namePublishedInYear Name Published In Year fields

Now we can use the field ID year to group our counts, returning years 2016 and onwards.

galah_call() |>
  identify("alcedinidae") |>
  filter(year > 2016) |>
  group_by(year) |>
  atlas_counts()
# A tibble: 9 × 2
  year   count
  <chr>  <int>
1 2023  176450
2 2022  143120
3 2021  129652
4 2020  109550
5 2018   96527
6 2019   94885
7 2017   79449
8 2024   70051
9 2025    7962

Alternatively, you can use the lubridate package to summarise after downloading counts.

We’ll convert our column eventDate to a date class in R. Then we can extract relevant date data…

# Using our pre-downloaded dataset
library(lubridate)

birds_date <- birds |>
  mutate(eventDate = date(eventDate), # convert to date
         year = year(eventDate),      # extract year
         month = month(eventDate,     # extract month
                       label = TRUE))

birds_date |>
  select(scientificName, eventDate, year, month)
# A tibble: 143,120 × 4
   scientificName                    eventDate   year month
   <chr>                             <date>     <dbl> <ord>
 1 Dacelo (Dacelo) novaeguineae      2022-04-19  2022 Apr  
 2 Dacelo (Dacelo) novaeguineae      2022-12-25  2022 Dec  
 3 Dacelo (Dacelo) novaeguineae      2022-10-27  2022 Oct  
 4 Dacelo (Dacelo) novaeguineae      2022-01-23  2022 Jan  
 5 Dacelo (Dacelo) novaeguineae      2022-11-09  2022 Nov  
 6 Todiramphus (Todiramphus) sanctus 2022-02-05  2022 Feb  
 7 Todiramphus (Todiramphus) sanctus 2022-11-24  2022 Nov  
 8 Dacelo (Dacelo) novaeguineae      2022-10-01  2022 Oct  
 9 Dacelo (Dacelo) novaeguineae      2022-03-21  2022 Mar  
10 Dacelo (Dacelo) novaeguineae      2022-08-14  2022 Aug  
# ℹ 143,110 more rows

…and summarise using dplyr or janitor.

Line plots

Another way to summarise temporal data is using line plots to visualise trends at different time scales over one or more years.

There are a few records that seem to be from 2021 despite downloading data for 20223. For simplicity, we’ll filter them out.

# filter dataset to 2022 only
birds_day <- birds_date |>
  filter(year(eventDate) == 2022) |>
  mutate(day = yday(eventDate))

Now we can group our records by each day of the year, and summarise the record count for each day…

birds_day <- birds_day |>
  group_by(day) |>
  summarise(count = n()) 
birds_day
# A tibble: 365 × 2
     day count
   <dbl> <int>
 1     1   893
 2     2   750
 3     3   698
 4     4   594
 5     5   470
 6     6   445
 7     7   400
 8     8   570
 9     9   685
10    10   434
# ℹ 355 more rows

…which we can visualise as a line plot. There are huge fluctuations in our daily count data (from near zero to nearly 1000 observations), so to make the plot easier to read, we can use a log10 scale.

ggplot(birds_day, aes(x = day, y = count)) +
  geom_line() +  # Add lines
  geom_point() + # Add points
  labs(x = "Day", y = "Count (log10)") +
  scale_x_continuous(breaks = seq(1, 365, by = 30)) +
  scale_y_log10() +  # Set logarithmic scale for y-axis
  theme_minimal()  # Set a minimal theme

Number of observations per day (2022)

The same method above can be used to group record counts by week4.

Code
birds_week <- birds_date |>
  filter(year(eventDate) == 2022) |>
  mutate(
    week = week(eventDate)) |>
  group_by(week) |>
  summarise(count = n()) 

ggplot(birds_week, aes(x = week, y = count)) +
  geom_line() +  # Add lines
  geom_point() + # Add points
  labs(x = "Week", y = "Count") +
  scale_x_continuous(breaks = seq(1, 52, by = 4)) + 
  theme_minimal()  # Set a minimal theme

Number of observations per week (2022)

Our temporal plots show that occurrences generally drop in the earlier months, then inflate in the later months of the year.

2.4 Summary

In this chapter we have provided a few ways to summarise your data taxonomically, spatially, and temporally. We hope that these code chunks will help you in summarising your own data. Summarising and visualising data are some of the most useful ways to spot errors for data cleaning. As such, we suggest using these tools often though the course of your analysis.

In the next part of this book, we will tackle these issues to clean your dataset.


  1. The Coordinate Reference System (CRS) determines how to display our shape of Australia, which exists on a spherical globe (the Earth), onto a flat surface (our map).↩︎

  2. Data from the ALA use EPSG:4326 (also known as “WGS84”) as the Coordinate Reference System. Transforming our map to the same projection of our data ensures the points are plotted in their actual locations on the map.↩︎

  3. This is due to timezone conversion when the ALA standardises its data. There are several timezones across Australia, so although these points might have been in 2022, once converted they fell outside of 2022!↩︎

  4. Notice, though, that we’ve ommitted the log scale because grouping by week has less variation in counts than by day (above).↩︎