Appendix A — Where to get data

There are many types of biodiversity data to work with. Some examples include observational data of where a species has been observed, climate or environmental data for region or area, biological data to compare measures of organisms’ physical or behavioural traits, and genetic data to compare unique DNA or alleles between individuals.

Here, we’ll detail what open source data is, provide some places to look for open source data and suggest some R packages that are useful for downloading different types of data.

A.0.1 Prerequisites

First, we’ll load packages that we’ll need to display data and figures over the chapter.

library(sf)
library(ggplot2)
library(dplyr)
library(tidyterra)
library(terra)
library(here)

A.1 Open-source data

Open-source data are data made openly accessible for editing and use, licensed under an open license. The following are some places where you can find open-source data.

A.1.1 Biodiversity data

Data infrastructures like the Atlas of Living Australia ingest, aggregate and standardise millions of rows of data from thousands of data providers. Some data comes from large providers with standardised workflows, like state government monitoring programs, iNaturalist Australia or eBird. These data providers use workflows that attempt to remove suspicious records prior to sharing data with a Living Atlas, and, in general, these workflows catch many issues that otherwise might need fixing.

However, not all data providers have standardised workflows. Some data has been transcribed from written survey records and provided by a small or independent data provider. Other data might have been transcribed from archived written records in a museum, or even in a scientists backlog from a long-lost research project. These data are valuable but inevitably prone to errors that are difficult to fix—handwriting can be smudged or difficult to read, records might be lacking important details about their location or time of observation. Even in data from standardised workflows, errors like taxonomic misidentification or flipped geospatial coordinates can slip through the cracks because expert knowledge is required to identify and amend individual records. These records can also range in their precision or level of detail, and might not be suitable for every type of analysis.

Ultimately, it’s a team effort to remove or fix data issues. Although a data infrastructure can use programmatic data quality checks to try to remove more extreme outliers, many errors are context dependent and require verification from the original data provider. This means that the responsibility to fix records usually falls on the data provider because only the data provider has knowledge required to amend their original data. Inevitably, there will be errors in data from many different sources, and equipped with this knowledge, we still need to clean data from data infrastructures to be suitable for our research question or analysis.

A.1.2 Spatial data

Spatial data contain information that corresponds to an area on the globe and can be plotted onto a map. Spatial data can be represented as vector or raster data.

There are two types of spatial data that you will probably use:

Here are some examples of where to download spatial data.

A.1.3 Taxonomic data

Taxonomy is a complex and broad field of investigation. A comprehensive look into taxonomy is well outside the scope of this book. However, It’s a good idea to consider the taxonomic classification of the organism(s) you’re interested in and any potential naming differences between data sources.

We do advise that before deciding on a final taxonomy to download or use, it’s worth being aware of what naming authority your data is using as its taxonomic backbone. In some taxonomic groups, names can vary widely depending on what taxonomic authority is used. Double check your data after your download them to make sure the classifications you expect are what you finding. This check will help prevent errors later on (though you might still need to re-code data manually).

We discuss these considerations in more detail in the Taxonomic Validation chapter.

Here are some examples of where to find Australian taxonomic information.

Name Description
The Australian Faunal Directory (AFD) An online catologue of nomenclature and taxonomy of animal species known to occur in Australia
The Australian Plant Name Index (APNI) A tool for the botanical community containing accepted scientific names of plants
The Australian Plant Census Contains the currently accepted scientific names for Australian vascular flora.

A.1.4 Trait data

Trait data contains measurements of organisms’ morphological or behavioural traits (e.g., stem length, leaf size, egg size, migratory distance, soil carbon). These data are useful for comparing spatial or temporal differences between individuals, groups or species.

The following are some examples of where to find trait data.

Name Description
Austraits A plant trait database that synthesises data from field surveys, published literature, taxonomic monographs, and individual taxon descriptions. The database holds nearly 500 traits across more than 30,000 taxa.

A.2 Packages for downloading data

There are a range of R packages available for accessing biodiversity data. These packages serve as convenient interfaces to various data providers by making respective APIs usable directly within R. The functionality offered by these packages typically ranges from querying species occurrence records, to more comprehensive taxonomic and spatial download queries.

Below, we highlight some commonly used packages. We encourage users to explore the documentation of each package to understand their capabilities, which will help you select one (or more!) that align with your specific needs.

A.2.1 Occurrence data

galah

galah is an interface for accessing biodiversity data like occurrences, counts, species and media (e.g., images & sounds) from the Living Atlases and GBIF.

In the majority of examples over this book we will be using the galah package. One benefit of using galah is that it uses tidy syntax (much like dplyr) to edit & filter download queries. Additionally, galah can access data from 10 other Living Atlases and GBIF.

library(galah)

galah_config(email = "your-email-here") # Registered ALA email

galah_call() |>
  identify("perameles") |>
  filter(year == 2001) |>
  atlas_occurrences()
# A tibble: 340 × 8
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 0053a1f3-b5e1… Perameles nas… https://biodi…           -34.4             151.
 2 0135db9a-80ff… Perameles nas… https://biodi…           -33.3             151.
 3 013b9cb6-7d89… Perameles nas… https://biodi…           -28.2             153.
 4 01c5d084-f4a5… Perameles nas… https://biodi…           -29.2             153.
 5 02943584-050a… Perameles nas… https://biodi…           -33.3             151.
 6 02af5fdd-4800… Perameles nas… https://biodi…           -29.3             152.
 7 02c5d0db-913f… Perameles nas… https://biodi…           -31.2             153.
 8 04ad578b-af11… Perameles nas… https://biodi…           -26.7             152.
 9 0518bfb9-cf9d… Perameles nas… https://biodi…           -34.3             151.
10 05496ff9-d61e… Perameles nas… https://biodi…           -33.3             151.
# ℹ 330 more rows
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>

Other packages

A.2.2 Spatial data

A.2.3 Trait data

A.3 Summary

Over this chapter, we hope you have found some ideas of where to access biodiversity data. The following chapter will help explain how to work with large datasets in R.