1 Inspect

Before diving into cleaning, it’s a good idea to familiarise yourself with the data. In this chapter we will cover ways to initially inspect the data and metadata of a dataset.

1.0.1 Prerequisites

In this chapter we will use Kingfisher (Alcedinidae) occurrence records in 2023 from the ALA.

# packages
library(galah)
library(ggplot2)
galah_config(email = "your-email-here") # ALA-registered email

birds <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.702e73e4-ba9d-471a-a0a4-872ecb5d0f32") |>
  atlas_occurrences()

Todiramphus (Todiramphus) sanctus perched on a branch. Photo by Kerri-Lee Harris CC-BY-NC 4.0 (Int)

Original download queries

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

birds <- galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2023) |>
  select(group = "basic", genus, species) |>
  atlas_occurrences()

1: We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

1.1 Getting to know your data

Metadata is a description of data, where information about different aspects of the data is documented. Some examples include definitions of the variables being measured, how and why the data were collected, and any data standards used. Reviewing the metadata associated with a dataset can be very helpful for understanding the types of data you are working with and any considerations that may need to be accounted for in your analyses.

Many datasets include descriptions of the data’s taxonomic, spatial, and temporal parameters. An example of well formatted metadata is of FrogID from the Australian Museum.

Metadata of the FrogID dataset on the ALA

From reading FrogID’s metadata (Rowley and Callaghan 2020), we understand that:

The dataset comprises acoustic data¹
This is citizen science data²
Audio is recorded via a smartphone app³
These data record presences, but not absences
The data are under a Creative Commons license which is relevant for reuse and republishing

Data standards

Many data infrastructures like the Atlas of Living Australia also follow and encourage a data standard to help consolidate data from many different data providers⁴.

The data standard used by the Atlas of Living Australia is called Darwin Core. Darwin Core works by defining a) a set of standard terms ⁵ to use across datasets as column names, and b) values eligible to be recorded underneath these terms. Darwin Core standards require that additional files detailing metadata and data structure are supplied along with the dataset. This helps make sure the data is ingested correctly into the data infrastructure.

Knowing whether your dataset follows a standard can allow you to look up term definitions to help you familiarise yourself with the data.

1.2 A first glimpse

When starting with a new dataset, we want to get an initial idea:

How many rows and columns are there?
What are the column names?
What types of data are in each column?
What are their possible values or ranges?

These answers are useful to know before jumping into wrangling and cleaning data.

There are several ways to return an overview of your data, ranging in how comprehensively you wish to summarise your data’s structure.

Return a condensed summary of your data’s structure using glimpse() from dplyr.

library(dplyr)

glimpse(birds)

Rows: 140,835
Columns: 13
$ recordID         <chr> "00000eb7-95a6-40ad-acc5-9b58ed2ab95b", "0000c741-b70…
$ scientificName   <chr> "Dacelo (Dacelo) novaeguineae", "Dacelo (Dacelo) nova…
$ taxonConceptID   <chr> "https://biodiversity.org.au/afd/taxa/1fc76c4d-4830-4…
$ decimalLatitude  <dbl> -37.77953, -37.70718, -42.96048, -36.89798, -32.90858…
$ decimalLongitude <dbl> 145.0460, 143.5939, 147.2209, 147.0573, 151.6859, 150…
$ eventDate        <dttm> 2022-04-19 11:30:00, 2022-12-25 07:37:00, 2022-10-27…
$ occurrenceStatus <chr> "PRESENT", "PRESENT", "PRESENT", "PRESENT", "PRESENT"…
$ dataResourceName <chr> "eBird Australia", "eBird Australia", "eBird Australi…
$ family           <chr> "Alcedinidae", "Alcedinidae", "Alcedinidae", "Alcedin…
$ genus            <chr> "Dacelo", "Dacelo", "Dacelo", "Dacelo", "Dacelo", "To…
$ species          <chr> "Dacelo novaeguineae", "Dacelo novaeguineae", "Dacelo…
$ cl22             <chr> "Victoria", "Victoria", "Tasmania", "Victoria", "New …
$ year             <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022,…

Return tables of descriptive statistics for each variable, grouped by data type (e.g., numeric, character, date) using skim() from skimr.

library(skimr)

skim(birds)

Data summary
Name	birds
Number of rows	140835
Number of columns	13
_______________________
Column type frequency:
character	9
numeric	3
POSIXct	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
recordID	0	1.00	36	36	140835
scientificName	0	1.00	6	42	22
taxonConceptID	0	1.00	73	73	22
occurrenceStatus	0	1.00	7	7	1
dataResourceName	0	1.00	8	58	17
family	0	1.00	11	11	1
genus	116	1.00	4	11	5
species	2859	0.98	12	24	10
cl22	1718	0.99	8	28	8

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
decimalLatitude	1	-28.84	7.91	-43.58	-34.97	-30.54	-26.32	35.6	▇▆▁▁▁
decimalLongitude	1	146.09	8.80	96.87	145.11	148.95	152.38	174.8	▁▁▂▇▁
year	1	2022.00	0.00	2022.00	2022.00	2022.00	2022.00	2022.0	▁▁▇▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
eventDate	4	1	2021-12-31 18:07:00	2022-12-31 22:50:00	2022-07-31 08:00:00	77135

Return a quick summary of your data’s structure using base R str()

str(birds)

spc_tbl_ [140,835 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ recordID        : chr [1:140835] "00000eb7-95a6-40ad-acc5-9b58ed2ab95b" "0000c741-b706-49dc-864e-bf9be82982a8" "0001057f-c7e9-4ae2-b7db-f0f03b4e52a0" "000121a3-a27e-4656-ad55-e817ebf05341" ...
 $ scientificName  : chr [1:140835] "Dacelo (Dacelo) novaeguineae" "Dacelo (Dacelo) novaeguineae" "Dacelo (Dacelo) novaeguineae" "Dacelo (Dacelo) novaeguineae" ...
 $ taxonConceptID  : chr [1:140835] "https://biodiversity.org.au/afd/taxa/1fc76c4d-4830-4129-9b86-1c7e944c3c50" "https://biodiversity.org.au/afd/taxa/1fc76c4d-4830-4129-9b86-1c7e944c3c50" "https://biodiversity.org.au/afd/taxa/1fc76c4d-4830-4129-9b86-1c7e944c3c50" "https://biodiversity.org.au/afd/taxa/1fc76c4d-4830-4129-9b86-1c7e944c3c50" ...
 $ decimalLatitude : num [1:140835] -37.8 -37.7 -43 -36.9 -32.9 ...
 $ decimalLongitude: num [1:140835] 145 144 147 147 152 ...
 $ eventDate       : POSIXct[1:140835], format: "2022-04-19 11:30:00" "2022-12-25 07:37:00" ...
 $ occurrenceStatus: chr [1:140835] "PRESENT" "PRESENT" "PRESENT" "PRESENT" ...
 $ dataResourceName: chr [1:140835] "eBird Australia" "eBird Australia" "eBird Australia" "eBird Australia" ...
 $ family          : chr [1:140835] "Alcedinidae" "Alcedinidae" "Alcedinidae" "Alcedinidae" ...
 $ genus           : chr [1:140835] "Dacelo" "Dacelo" "Dacelo" "Dacelo" ...
 $ species         : chr [1:140835] "Dacelo novaeguineae" "Dacelo novaeguineae" "Dacelo novaeguineae" "Dacelo novaeguineae" ...
 $ cl22            : chr [1:140835] "Victoria" "Victoria" "Tasmania" "Victoria" ...
 $ year            : num [1:140835] 2022 2022 2022 2022 2022 ...
 - attr(*, "spec")=
  .. cols(
  ..   recordID = col_character(),
  ..   scientificName = col_character(),
  ..   taxonConceptID = col_character(),
  ..   decimalLatitude = col_double(),
  ..   decimalLongitude = col_double(),
  ..   eventDate = col_datetime(format = ""),
  ..   occurrenceStatus = col_character(),
  ..   dataResourceName = col_character(),
  ..   family = col_character(),
  ..   genus = col_character(),
  ..   species = col_character(),
  ..   cl22 = col_character(),
  ..   year = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
 - attr(*, "doi")= chr "https://doi.org /10.26197/ala.702e73e4-ba9d-471a-a0a4-872ecb5d0f32"

Dacelo (Dacelo) novaeguineae perched with a fresh worm in its beak. Photo by Rob Shepherd CC-BY-NC 4.0 (Int)

At this early stage, it’s helpful to assess whether your dataset meets your expectations. Consider if the data appear as anticipated. Are the values in each column reasonable? Are there any noticeable gaps or errors that might need to be corrected, or that could potentially render the data unusable?

1.3 Next steps

We have just learned some ways to initially inspect our dataset. Keep in mind, we don’t expect everything to be perfect. Some issues are expected and may indicate problems with our query or the data itself. This initial inspection is a good opportunity to identify where these issues might be and assess their severity.

When you are confident that the dataset is largely as expected, you are ready to start summarising your data.

Meaning the majority of individuals recorded are male.↩︎
Suggesting these data could be biased towards populated areas.↩︎
As a result, the authors recommend filtering data to geographic uncertainty of <3000m if you require high coordinate precision.↩︎
Making datasets easier to consolidate is also referred to as interoperability, one of the principles of FAIR data.↩︎
We suggest using Ctrl/CMD + F and searching your variable name on the webpage. Don’t hesitate to Google variable names if you are unsure what they represent.↩︎