1  Inspect

Before diving into cleaning, it’s a good idea to familiarise yourself with the data. In this chapter we will cover ways to initially inspect the data and metadata of a dataset.

1.0.1 Prerequisites

In this chapter we will use Kingfisher (Alcedinidae) occurrence records in 2023 from the ALA.

# packages
library(galah)
library(ggplot2)
galah_config(email = "your-email-here") # ALA-registered email

birds <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.702e73e4-ba9d-471a-a0a4-872ecb5d0f32") |>
  atlas_occurrences()

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

birds <- galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2023) |>
  select(group = "basic", genus, species) |>
  atlas_occurrences()
1
We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

1.1 Getting to know your data

Metadata is a description of data, where information about different aspects of the data is documented. Some examples include definitions of the variables being measured, how and why the data were collected, and any data standards used. Reviewing the metadata associated with a dataset can be very helpful for understanding the types of data you are working with and any considerations that may need to be accounted for in your analyses.

Many datasets include descriptions of the data’s taxonomic, spatial, and temporal parameters. An example of well formatted metadata is of FrogID from the Australian Museum.

Metadata of the FrogID dataset on the ALA

From reading FrogID’s metadata (Rowley and Callaghan 2020), we understand that:

  1. The dataset comprises acoustic data1
  2. This is citizen science data2
  3. Audio is recorded via a smartphone app3
  4. These data record presences, but not absences
  5. The data are under a Creative Commons license which is relevant for reuse and republishing

Many data infrastructures like the Atlas of Living Australia also follow and encourage a data standard to help consolidate data from many different data providers4.

The data standard used by the Atlas of Living Australia is called Darwin Core. Darwin Core works by defining a) a set of standard terms5 to use across datasets as column names, and b) values eligible to be recorded underneath these terms. Darwin Core standards require that additional files detailing metadata and data structure are supplied along with the dataset. This helps make sure the data is ingested correctly into the data infrastructure.

Knowing whether your dataset follows a standard can allow you to look up term definitions to help you familiarise yourself with the data.

1.2 A first glimpse

When starting with a new dataset, we want to get an initial idea:

  • How many rows and columns are there?
  • What are the column names?
  • What types of data are in each column?
  • What are their possible values or ranges?

These answers are useful to know before jumping into wrangling and cleaning data.

There are several ways to return an overview of your data, ranging in how comprehensively you wish to summarise your data’s structure.

At this early stage, it’s helpful to assess whether your dataset meets your expectations. Consider if the data appear as anticipated. Are the values in each column reasonable? Are there any noticeable gaps or errors that might need to be corrected, or that could potentially render the data unusable?

1.3 Next steps

We have just learned some ways to initially inspect our dataset. Keep in mind, we don’t expect everything to be perfect. Some issues are expected and may indicate problems with our query or the data itself. This initial inspection is a good opportunity to identify where these issues might be and assess their severity.

When you are confident that the dataset is largely as expected, you are ready to start summarising your data.


  1. Meaning the majority of individuals recorded are male.↩︎

  2. Suggesting these data could be biased towards populated areas.↩︎

  3. As a result, the authors recommend filtering data to geographic uncertainty of <3000m if you require high coordinate precision.↩︎

  4. Making datasets easier to consolidate is also referred to as interoperability, one of the principles of FAIR data.↩︎

  5. We suggest using Ctrl/CMD + F and searching your variable name on the webpage. Don’t hesitate to Google variable names if you are unsure what they represent.↩︎