8  Taxonomic validation

Taxonomic classification is in a state of constant change. Advances in taxonomy, especially in molecular biology have allowed researchers to describe new species more efficiently than ever before (Garraffoni et al. 2019). Modern approaches have enabled reclassification of organisms that have been incorrectly described in the past. As new discoveries are made, taxonomies are frequently updated or amended.

This process of changing taxonomy makes working with open source biodiversity data difficult. Views may differ within the literature or across authorities about which taxonomy is true. In different countries, one taxonomy might suit the native taxonomic diversity better than other taxonomies. Data infrastructures must also make choices about which taxonomic authorities they choose to use, and different infrastructures inevitably make different decisions.

As a result, most taxonomic data will need checking and cleaning before use. You will encounter situations where the same species has several taxonomic names (synonyms) or where the same name can refer to several entirely different taxonomic groups (homonyms). These situations can be tricky to identify and clean when working with taxonomic data.

While there is no perfect solution, some tips, tricks and tools do exist. In this chapter we will go through some of these to clean taxonomic data. This includes ways to deal with missing taxonomic information, taxonomic synonyms and homonyms.

Cleaning taxonomic names can require a lot of changes! When cleaning taxonomic names, we recommend that you maintain a clear and explicit record of any decisions and changes made with respect to the data.

8.0.1 Prerequisites

In this chapter we will use several datasets:

  • Kingfisher (Alcedinidae) occurrence records from 2022 from the ALA
  • Legless lizard (Pygopodidae) occurrence records from 2021-2023 from the ALA
  • A subset of invertebrate occurrence records taken from the Curated Plant and Invertebrate Data for Bushfire Modelling data set, saved in the inverts.parquet file
  • Eucalyptus occurrence records from 2014 from the ALA
  • Eucalyptus species list downloaded from GBIF, saved in the gbif_species_list.parquet file

Download the inverts.parquet and gbif_species_list.parquet files from the Data in this book chapter.

# packages
library(here)
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(janitor)
library(galah)
galah_config(email = "your-email-here",       # ALA-registered email
             username = "your-email-here",    # GBIF account email
             password = "your-password-here") # GBIF account password

birds <- galah_call() |>
  filter(doi == "https://doi.org/10.26197/ala.d37501c0-a32b-43f7-b1a8-660deccc9ea7") |>
  atlas_occurrences()

legless_lizards <- galah_call() |>
  filter(doi == "https://doi.org/10.26197/ala.3d88b810-b1f8-4a5e-a71a-1e847f922054") |>
  atlas_occurrences()

inverts <- arrow::read_parquet(
  here("path", "to", "inverts.parquet"))

eucalypts <- galah_call() |>
  filter(doi == "https://doi.org/10.26197/ala.48ccdb77-3d23-4543-8a03-1a1c487f6bc0") |>
  atlas_occurrences()

gbif_species_list <- arrow::read_parquet(
  here("path", "to", "gbif_eucalyptus.parquet"))

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

birds <- galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2022) |>
  select(group = "basic", 
         family, genus, species, 
         cl22, eventDate, year) |>
  atlas_occurrences()

legless_lizards <- galah_call() |>
  identify("pygopodidae") |>
  filter(year > 2020) |>
  select(group = "basic") |>
  atlas_occurrences()

eucalypts <- galah_call() |>
  identify("Eucalyptus") |>
  filter(eventDate > "2014-01-01T00:00:00Z",
         eventDate < "2014-06-01T00:00:00Z") |>
  select(group = "basic", 
         kingdom, phylum, class, order, 
         family, genus, species, taxonRank) |>
  atlas_occurrences()

gbif_species_list <- request_data("species") |>
  identify("Eucalyptus") |>
  collect()
1
We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

8.1 Preview names

One of the simplest ways to determine whether there are any immediate issues with taxonomic names is to print some of them. Most biodiversity datasets have a scientificName or scientific_name field that specifies the lowest identifiable scientific name for each record. Looking at scientificName in our birds data, we can already notice some characteristics of the names in our data, namely that:

  1. Records have been identified to different taxonomic ranks (we can see subspecies, species, genus and family names)
  2. Some names are formatted in all capitals while others are not
  3. Some names have bracketed parts
birds |>
  distinct(scientificName) |>
  print(n = 25)
# A tibble: 22 × 1
   scientificName                            
   <chr>                                     
 1 Dacelo (Dacelo) novaeguineae              
 2 Todiramphus (Todiramphus) sanctus         
 3 Ceyx azureus                              
 4 Todiramphus (Lazulena) macleayii          
 5 Dacelo (Dacelo) leachii                   
 6 Tanysiptera (Uralcyon) sylvia             
 7 Ceyx pusillus                             
 8 Todiramphus (Cyanalcyon) pyrrhopygius     
 9 Syma torotoro                             
10 Todiramphus                               
11 ALCEDINIDAE                               
12 Dacelo (Dacelo) novaeguineae novaeguineae 
13 Dacelo (Dacelo) leachii leachii           
14 Todiramphus (Todiramphus) sanctus sanctus 
15 Todiramphus (Todiramphus) chloris         
16 Todiramphus (Todiramphus) sanctus vagans  
17 Ceyx azureus azureus                      
18 Dacelo                                    
19 Ceyx azureus diemenensis                  
20 Todiramphus (Lazulena) macleayii macleayii
21 Todiramphus (Lazulena) macleayii incinctus
22 Ceyx azureus ruficollaris                 

A quick preview helps us determine what to do next to clean them.

8.2 Name format

Different data providers might use different formats in their taxonomic names to delineate between taxonomic ranks. It doesn’t matter which format your data uses as long as it remains consistent.

Example 1: Subspecies

An an example, data from the ALA specifies subspecies of Acacia observations using "subsp." in the scientific name, whereas subspecies of bird observations simply add an additional name.

acacia_2018 <- galah_call() |>
  identify("Acacia") |>
  filter(year == 2018) |>
  atlas_occurrences()

acacia_2018 |>
  filter(str_detect(scientificName, "Acacia brunioides")) |>
  distinct(scientificName)
# A tibble: 2 × 1
  scientificName                     
  <chr>                              
1 Acacia brunioides subsp. brunioides
2 Acacia brunioides                  
birds_2023 <- galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2023) |>
  atlas_occurrences()
  
birds_2023 |>
  filter(str_detect(scientificName, "Dacelo")) |>
  distinct(scientificName)
# A tibble: 6 × 1
  scientificName                           
  <chr>                                    
1 Dacelo (Dacelo) novaeguineae             
2 Dacelo (Dacelo) leachii                  
3 Dacelo (Dacelo) novaeguineae novaeguineae
4 Dacelo                                   
5 Dacelo (Dacelo) leachii occidentalis     
6 Dacelo (Dacelo) leachii leachii          

Although both are correct, be sure to check your data to make sure that this naming format is consistent.

Example 2: Subgenera

Subgenera are present in many Animalian clades, but their formatting can vary. In the ALA’s scientificName field, subgenera are specified in brackets between the genus and species names.

birds |>
  filter(str_detect(scientificName, "Dacelo")) |>
  distinct(scientificName)
# A tibble: 5 × 1
  scientificName                           
  <chr>                                    
1 Dacelo (Dacelo) novaeguineae             
2 Dacelo (Dacelo) leachii                  
3 Dacelo (Dacelo) novaeguineae novaeguineae
4 Dacelo (Dacelo) leachii leachii          
5 Dacelo                                   

They are not, however, specified in the species field.

birds |>
  filter(str_detect(species, "Dacelo")) |>
  distinct(species)
# A tibble: 2 × 1
  species            
  <chr>              
1 Dacelo novaeguineae
2 Dacelo leachii     

Again, both are correct, so be sure to use the naming format that suits your needs best.

8.3 Matching names to a species list

Many investigations use a taxonomic list of species or groups to help identify which species are relevant. Using lists of introduced, invasive, threatened or sensitive species to identify species records of interest is a common example.

There are several ways to filter records to match names on a species list. First, we’ll use a species list accessed using galah to filter records, which offers other functionality for filtering data prior to download. Then we’ll use an external species list loaded into R to filter records.

galah

The ALA contains national and state-based conservation status lists. For example, if we wanted to use the Victorian Restricted Species list, we can do a text search for available lists for “victoria” using search_all(lists, "victoria").

list_search <- search_all(lists, "victoria")
list_search
# A tibble: 33 × 19
   species_list_uid listName       listType dateCreated lastUpdated lastUploaded
   <chr>            <chr>          <chr>    <chr>       <chr>       <chr>       
 1 dr1266           "2 b) Protect… LOCAL_L… 2014-07-31… 2017-02-15… 2017-02-15T…
 2 dr1782           "Advisory Lis… CONSERV… 2014-10-27… 2022-03-16… 2022-03-16T…
 3 dr967            "Advisory Lis… CONSERV… 2013-11-12… 2023-06-12… 2023-06-12T…
 4 dr2504           "ALT Waterbug… LOCAL_L… 2015-09-08… 2016-06-14… 2016-06-14T…
 5 dr2683           "Dung beetles… LOCAL_L… 2016-01-15… 2020-08-20… 2020-08-20T…
 6 dr4890           "Endangered P… CONSERV… 2016-05-07… 2016-06-14… 2016-06-14T…
 7 dr17134          "Endangered S… CONSERV… 2021-03-30… 2022-11-21… 2022-11-21T…
 8 dr6635           "Gippsland’s … LOCAL_L… 2016-11-15… 2016-11-15… 2016-11-15T…
 9 dr9802           "Great Victor… LOCAL_L… 2018-11-29… 2018-11-29… 2018-11-29T…
10 dr7749           "IBRA Great V… PROFILE  2017-06-19… 2017-07-03… 2017-07-03T…
# ℹ 23 more rows
# ℹ 13 more variables: lastMatched <chr>, username <chr>, itemCount <int>,
#   region <chr>, isAuthoritative <lgl>, isInvasive <lgl>, isThreatened <lgl>,
#   wkt <chr>, category <chr>, generalisation <chr>, authority <chr>,
#   sdsType <chr>, looseSearch <lgl>

Filtering our result to only authoritative lists can help us find official state lists.

list_search |> 
  filter(isAuthoritative == TRUE)
# A tibble: 2 × 19
  species_list_uid listName        listType dateCreated lastUpdated lastUploaded
  <chr>            <chr>           <chr>    <chr>       <chr>       <chr>       
1 dr655            Victoria : Con… CONSERV… 2015-04-04… 2024-05-30… 2024-05-30T…
2 dr490            Victorian Rest… SENSITI… 2013-06-23… 2024-05-30… 2024-05-30T…
# ℹ 13 more variables: lastMatched <chr>, username <chr>, itemCount <int>,
#   region <chr>, isAuthoritative <lgl>, isInvasive <lgl>, isThreatened <lgl>,
#   wkt <chr>, category <chr>, generalisation <chr>, authority <chr>,
#   sdsType <chr>, looseSearch <lgl>

Now that we have found our desired list, we can return its contents by using show_values().

vic_species_list <- search_all(lists, "dr490") |>
  show_values()
1
We are using the list ID dr490 (specified in the species_list_uid column) to make sure we return the correct list
• Showing values for 'dr490'.
vic_species_list
# A tibble: 137 × 6
        id name                  commonName scientificName lsid  dataResourceUid
     <int> <chr>                 <chr>      <chr>          <chr> <chr>          
 1 5920169 Engaeus australis     Lilly Pil… Engaeus austr… http… dr490          
 2 5920143 Engaeus fultoni       Otway Bur… Engaeus fulto… http… dr490          
 3 5920250 Engaeus mallacoota    Mallacoot… Engaeus malla… http… dr490          
 4 5920180 Engaeus phyllocercus  Narracan … Engaeus phyll… http… dr490          
 5 5920240 Engaeus rostrogaleat… Strzeleck… Engaeus rostr… http… dr490          
 6 5920203 Engaeus sericatus     Hairy Bur… Engaeus seric… http… dr490          
 7 5920217 Engaeus sternalis     Warragul … Engaeus stern… http… dr490          
 8 5920238 Engaeus strictifrons  Portland … Engaeus stric… http… dr490          
 9 5920170 Engaeus urostrictus   Dandenong… Engaeus urost… http… dr490          
10 5920214 Euastacus bidawalus   East Gipp… Euastacus bid… http… dr490          
# ℹ 127 more rows

Now we can use our vic_species_list to identify any restricted species by matching names in our legless_lizards data to names in vic_species_list.

legless_lizards_filtered <- legless_lizards |>
  filter(!scientificName %in% vic_species_list$scientificName)

legless_lizards_filtered
# A tibble: 1,967 × 8
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 001129f4-4824… Pygopus lepid… https://biodi…           -34.0             151.
 2 0031c737-922a… Pygopus lepid… https://biodi…           -36.0             150.
 3 005dfdd2-4a93… Lialis burton… https://biodi…           -29.1             152.
 4 0063af2c-e070… Pygopus lepid… https://biodi…           -34.9             139.
 5 00a9ffcd-ec03… Lialis burton… https://biodi…           -27.5             153.
 6 00dc4542-426a… Aprasia pseud… https://biodi…           -34.7             139.
 7 010eb86a-7bd4… Lialis burton… https://biodi…           -30.2             153.
 8 0157207c-3a91… Pygopus lepid… https://biodi…           -33.7             150.
 9 0175d058-1e71… Delma molleri  https://biodi…           -34.7             139.
10 0184f709-9cec… Pygopus lepid… https://biodi…           -33.7             150.
# ℹ 1,957 more rows
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>

This process has removed more than 130 records from our data.

nrow(legless_lizards) - nrow(legless_lizards_filtered)
[1] 142

We can also filter our queries prior to downloading data in galah by adding a filter specifying species_list_uid == dr490 to our query.

galah_call() |>
  identify("Pygopodidae") |>
  filter(species_list_uid == dr490) |>
  group_by(species) |>
  atlas_counts()
1
We are using the list ID dr490 (specified in the species_list_uid column) to make sure we return the correct list
# A tibble: 2 × 2
  species               count
  <chr>                 <int>
1 Aprasia parapulchella   687
2 Aprasia aurita          102

Using an external list

We can also use lists downloaded outside of galah to filter our data. As an example, let’s filter our taxonomic names to only Australian names on the Global Register of Introduced and Invasive Species (GRIIS). After downloading this list and saving it in your working directory, we can read the list into R. Taxonomic names are held in columns with an accepted_name prefix.

griis <- read_csv(here("GRIIS_Australia_20230331-121730.csv"))

glimpse(griis)
Rows: 2,979
Columns: 16
$ scientific_name                  <chr> "Oenothera longiflora L.", "Lampranth…
$ scientific_name_type             <chr> "species", "species", "species", "spe…
$ kingdom                          <chr> "Plantae", "Plantae", "Plantae", "Pla…
$ establishment_means              <chr> "alien", "alien", "alien", "alien", "…
$ is_invasive                      <chr> "null", "null", "null", "null", "null…
$ occurrence_status                <chr> "present", "present", "present", "pre…
$ checklist.name                   <chr> "Australia", "Australia", "Australia"…
$ checklist.iso_countrycode_alpha3 <chr> "AUS", "AUS", "AUS", "AUS", "AUS", "A…
$ accepted_name.species            <chr> "Oenothera longiflora", "Lampranthus …
$ accepted_name.kingdom            <chr> "Plantae", "Plantae", "Plantae", "Pla…
$ accepted_name.phylum             <chr> "Tracheophyta", "Tracheophyta", "Trac…
$ accepted_name.class              <chr> "Magnoliopsida", "Magnoliopsida", "Ma…
$ accepted_name.order              <chr> "Myrtales", "Caryophyllales", "Erical…
$ accepted_name.family             <chr> "Onagraceae", "Aizoaceae", "Ericaceae…
$ accepted_name.habitat            <chr> "[\"terrestrial\"]", "[\"terrestrial\…
$ accepted_name                    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Now we can check which species names in our legless_lizards data match names in griis.

# Check which species matched the GRIIS list
matches <- eucalypts |> 
  filter(scientificName %in% griis$accepted_name.species)

matches
# A tibble: 0 × 16
# ℹ 16 variables: recordID <chr>, scientificName <chr>, taxonConceptID <chr>,
#   decimalLatitude <dbl>, decimalLongitude <dbl>, eventDate <dttm>,
#   occurrenceStatus <chr>, dataResourceName <chr>, kingdom <chr>,
#   phylum <chr>, class <chr>, order <chr>, family <chr>, genus <chr>,
#   species <chr>, taxonRank <chr>

After looking through the matches and confirming we are happy with the list of matched species, we can exclude these taxa from our data, removing the rows identified above.

legless_lizards_filtered <- legless_lizards |>
  filter(!scientificName %in% matches)

legless_lizards_filtered
# A tibble: 2,109 × 8
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 001129f4-4824… Pygopus lepid… https://biodi…           -34.0             151.
 2 0027660b-3e75… Aprasia parap… https://biodi…           -35.4             149 
 3 0031c737-922a… Pygopus lepid… https://biodi…           -36.0             150.
 4 005dfdd2-4a93… Lialis burton… https://biodi…           -29.1             152.
 5 0063af2c-e070… Pygopus lepid… https://biodi…           -34.9             139.
 6 00a9ffcd-ec03… Lialis burton… https://biodi…           -27.5             153.
 7 00dc4542-426a… Aprasia pseud… https://biodi…           -34.7             139.
 8 010eb86a-7bd4… Lialis burton… https://biodi…           -30.2             153.
 9 0157207c-3a91… Pygopus lepid… https://biodi…           -33.7             150.
10 0175d058-1e71… Delma molleri  https://biodi…           -34.7             139.
# ℹ 2,099 more rows
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>
Tip

You can apply this concept of filtering to any list of species, or other fields, that you would like to exclude.

8.4 Taxonomic names matching

8.4.1 Missing higher taxonomic information

It’s not uncommon to receive data that contains some but not all taxonomic rank information. Missing this information can make it difficult to summarise data or create taxonomic visualisations later on.

As an example, here is a small sample of our inverts dataset. You’ll notice that we only have scientific_name, class and family information.

inverts_sample <- inverts |>
  slice(1234:1271)

inverts_sample |> print(n = 5)
# A tibble: 38 × 9
  record_id      scientific_name class family  year latitude longitude sensitive
  <chr>          <chr>           <chr> <chr>  <int>    <dbl>     <dbl>     <int>
1 76213a64-ed41… Helicotylenchu… chro… hoplo…    NA    -23.1      151.         0
2 e74ec2f0-4cef… Iravadia (Irav… gast… irava…  1903    -16.5      140.         0
3 340c2b82-6b85… Monomorium bic… inse… formi…  1998    -24.7      150.         0
4 e7dc1fa1-6524… Saprosites men… inse… scara…  2004    -43.1      147.         0
5 316ad303-efc6… Amitermes darw… inse… termi…  1953    -21.9      118.         0
# ℹ 33 more rows
# ℹ 1 more variable: project <chr>

One way to extract names is to search for names matches in a data infrastructure like the ALA, which has its own taxonomic backbone. We can extract the names from our inverts_sample and save the strings in taxa_sample_names

taxa_sample_names <- inverts_sample |>
  select(scientific_name) |>
  distinct() |>
  pull()

taxa_sample_names[1:5] # first 5 names
[1] "Helicotylenchus multicinctus"        "Iravadia (Iravadia) carpentariensis"
[3] "Monomorium bicorne"                  "Saprosites mendax"                  
[5] "Amitermes darwini"                  

…and use those names to search using search_taxa() from galah. We’ll save the results in names_matches_ala.

Anytime you search for taxonomic matches using names, it’s good to double check the urls returned in taxon_concept_id to make sure your search matched the result you expected!

names_matches_ala <- search_taxa(taxa_sample_names)
names_matches_ala
# A tibble: 38 × 15
   search_term     scientific_name scientific_name_auth…¹ taxon_concept_id rank 
   <chr>           <chr>           <chr>                  <chr>            <chr>
 1 Helicotylenchu… Helicotylenchu… (Cobb, 1893)           https://biodive… spec…
 2 Iravadia (Irav… Iravadia (Irav… (Hedley, 1912)         https://biodive… spec…
 3 Monomorium bic… Chelaner bicor… (Forel, 1907)          https://biodive… spec…
 4 Saprosites men… Saprosites men… (Blackburn, 1892)      https://biodive… spec…
 5 Amitermes darw… Amitermes darw… (Hill, 1922)           https://biodive… spec…
 6 Schedorhinoter… Schedorhinoter… (Hill, 1933)           https://biodive… spec…
 7 Sorama bicolor  Sorama bicolor  Walker, 1855           https://biodive… spec…
 8 Windbalea warr… Windbalea warr… Rentz, 1993            https://biodive… spec…
 9 Tholymis tilla… Tholymis tilla… (Fabricius, 1798)      https://biodive… spec…
10 Costellipitar … Costellipitar … (Hedley, 1923)         https://biodive… spec…
# ℹ 28 more rows
# ℹ abbreviated name: ¹​scientific_name_authorship
# ℹ 10 more variables: match_type <chr>, kingdom <chr>, phylum <chr>,
#   class <chr>, order <chr>, family <chr>, genus <chr>, species <chr>,
#   vernacular_name <chr>, issues <chr>

Now we can merge this information into our inverts_sample data so we can use it.

First, let’s select relevant columns from names_matches_ala that we want to use. Before joining, let’s rename the columns so we can tell apart our initial names from the ALA names by adding an "_ala" suffix to each column name.

names_matches_renamed <- names_matches_ala |>
  select(scientific_name, kingdom:species) |>
  rename_with(\(column_name) paste0(column_name, "_ala"),
              kingdom:species)
names_matches_renamed
1
This line uses shorthand to write a function to append a suffix to a column name. An equivalent way of writing this is:
function(column_name) {paste0(column_name, "_ala)}

This is applied to each column name from kingdom to species in the names_matches_ala dataframe.
# A tibble: 38 × 8
   scientific_name         kingdom_ala phylum_ala class_ala order_ala family_ala
   <chr>                   <chr>       <chr>      <chr>     <chr>     <chr>     
 1 Helicotylenchus multic… Animalia    Nematoda   Chromado… Panagrol… Hoplolaim…
 2 Iravadia (Iravadia) ca… Animalia    Mollusca   Gastropo… Hypsogas… Iravadiid…
 3 Chelaner bicorne        Animalia    Arthropoda Insecta   Hymenopt… Formicidae
 4 Saprosites mendax       Animalia    Arthropoda Insecta   Coleopte… Scarabaei…
 5 Amitermes darwini       Animalia    Arthropoda Insecta   Blattodea Termitidae
 6 Schedorhinotermes actu… Animalia    Arthropoda Insecta   Blattodea Rhinoterm…
 7 Sorama bicolor          Animalia    Arthropoda Insecta   Lepidopt… Notodonti…
 8 Windbalea warrooa       Animalia    Arthropoda Insecta   Orthopte… Tettigoni…
 9 Tholymis tillarga       Animalia    Arthropoda Insecta   Odonata   Libelluli…
10 Costellipitar inconsta… Animalia    Mollusca   Bivalvia  Cardiida  Veneridae 
# ℹ 28 more rows
# ℹ 2 more variables: genus_ala <chr>, species_ala <chr>

Now let’s join our matched names in names_matches_renamed to our inverts_sample data. This adds all higher taxonomic names columns to our inverts_sample data.

inverts_sample_with_ranks <- names_matches_renamed |>
  right_join(inverts_sample,                          # join to `inverts_sample`
             join_by(scientific_name == scientific_name)
             )
inverts_sample_with_ranks
# A tibble: 38 × 16
   scientific_name         kingdom_ala phylum_ala class_ala order_ala family_ala
   <chr>                   <chr>       <chr>      <chr>     <chr>     <chr>     
 1 Helicotylenchus multic… Animalia    Nematoda   Chromado… Panagrol… Hoplolaim…
 2 Iravadia (Iravadia) ca… Animalia    Mollusca   Gastropo… Hypsogas… Iravadiid…
 3 Saprosites mendax       Animalia    Arthropoda Insecta   Coleopte… Scarabaei…
 4 Amitermes darwini       Animalia    Arthropoda Insecta   Blattodea Termitidae
 5 Schedorhinotermes actu… Animalia    Arthropoda Insecta   Blattodea Rhinoterm…
 6 Sorama bicolor          Animalia    Arthropoda Insecta   Lepidopt… Notodonti…
 7 Windbalea warrooa       Animalia    Arthropoda Insecta   Orthopte… Tettigoni…
 8 Tholymis tillarga       Animalia    Arthropoda Insecta   Odonata   Libelluli…
 9 Costellipitar inconsta… Animalia    Mollusca   Bivalvia  Cardiida  Veneridae 
10 Placamen lamellosum     Animalia    Mollusca   Bivalvia  Cardiida  Veneridae 
# ℹ 28 more rows
# ℹ 10 more variables: genus_ala <chr>, species_ala <chr>, record_id <chr>,
#   class <chr>, family <chr>, year <int>, latitude <dbl>, longitude <dbl>,
#   sensitive <int>, project <chr>

To double check that our join worked correctly by making sure names in our original family column all match our new family_ala column. If the join did not work correctly, we would expect many rows to be returned because there would be NA values in any rows that didn’t match a scientific_name.

Nothing is returned, meaning the names in family_ala and family all match and our join worked correctly!

inverts_sample_with_ranks |>
  select(scientific_name, family_ala, family) |>
  mutate(family = stringr::str_to_sentence(family)) |> # match formatting
  filter(family_ala != family)
# A tibble: 0 × 3
# ℹ 3 variables: scientific_name <chr>, family_ala <chr>, family <chr>

8.4.2 Identifying mismatches in species lists

Higher taxonomy from different data providers may not always match. If this is the case, you will need to back-fill the higher taxonomic ranks using data from your preferred taxonomic naming authority.

Let’s use data of Eucalyptus observations we downloaded from the ALA as an example.

eucalypts
# A tibble: 8,467 × 16
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 0009ba6a-8e8e… Eucalyptus re… https://id.bi…           -17.6             145.
 2 002b74ab-b8ce… Eucalyptus ca… https://id.bi…           -34.2             141.
 3 002bde6c-3a7f… Eucalyptus co… https://id.bi…           -30.1             146.
 4 002cb2ce-c8a1… Eucalyptus ca… https://id.bi…           -37.1             141.
 5 0031022c-8e9e… Eucalyptus la… https://id.bi…           -34.4             142.
 6 00407506-383e… Eucalyptus pa… https://id.bi…           -34.1             151.
 7 004413ca-5a95… Eucalyptus po… https://id.bi…           -35.3             149.
 8 005371a8-047e… Eucalyptus ca… https://id.bi…           -35.7             145.
 9 00560db1-bb66… Eucalyptus da… https://id.bi…           -36.3             148.
10 005fcf1f-3c6f… Eucalyptus no… https://id.bi…           -30.4             152.
# ℹ 8,457 more rows
# ℹ 11 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, kingdom <chr>, phylum <chr>, class <chr>,
#   order <chr>, family <chr>, genus <chr>, species <chr>, taxonRank <chr>

This occurrence data contains observations of over 373 species.

eucalypts |>
  filter(taxonRank != "genus") |>
  distinct(scientificName) |>
  count(name = "n_species")
# A tibble: 1 × 1
  n_species
      <int>
1       373

Let’s say we want to compare these observations to data retrieved outside of the ALA and decide that we’d prefer to use GBIF’s1 taxonomy. ALA data uses its own taxonomic backbone that differs to GBIF’s (depending on the taxonomic group), so we will need to amend our taxonomic names to match GBIF’s.

Let’s go through the steps to match our taxonomy in our eucalypts data to GBIF’s taxonomy. We can download a species list of Eucalyptus from GBIF. This list returns nearly 1,700 species names.

Download the gbif_species_list.parquet file from the Data in this book chapter.

Note: This is the original query to download this species list from GBIF. It takes several minutes to download, if you would like to download the most up-to-date version of this list.

library(galah)
gbif_species_list <- request_data("species") |>
  identify("Eucalyptus") |>
  collect()

gbif_species_list
gbif_species_list
# A tibble: 1,695 × 22
   taxonKey scientificName               acceptedTaxonKey acceptedScientificName
 *    <dbl> <chr>                                   <dbl> <chr>                 
 1  3176716 Eucalyptus calcicola Brooker          3176716 Eucalyptus calcicola …
 2  3176802 Eucalyptus salicola Brooker           3176802 Eucalyptus salicola B…
 3  3176920 Eucalyptus crebra F.Muell.            3176920 Eucalyptus crebra F.M…
 4  3177269 Eucalyptus stricta Sieber e…          3177269 Eucalyptus stricta Si…
 5  3717566 Eucalyptus alpina Lindl.              3717566 Eucalyptus alpina Lin…
 6  8164544 Eucalyptus hemiphloia var. …          7908015 Eucalyptus albens Miq.
 7  9292334 Eucalyptus goniocalyx subsp…          9292334 Eucalyptus goniocalyx…
 8 11127669 Eucalyptus griffithii Maiden         11127669 Eucalyptus griffithii…
 9  3176297 Eucalyptus camfieldii Maiden          3176297 Eucalyptus camfieldii…
10  3176473 Eucalyptus macrorhyncha sub…          3176473 Eucalyptus macrorhync…
# ℹ 1,685 more rows
# ℹ 18 more variables: numberOfOccurrences <dbl>, taxonRank <chr>,
#   taxonomicStatus <chr>, kingdom <chr>, kingdomKey <dbl>, phylum <chr>,
#   phylumKey <dbl>, class <chr>, classKey <dbl>, order <chr>, orderKey <dbl>,
#   family <chr>, familyKey <dbl>, genus <chr>, genusKey <dbl>, species <chr>,
#   speciesKey <dbl>, iucnRedListCategory <chr>

To investigate whether the complete taxonomy—from kingdom to species—matches between our ALA data and GBIF species list, let’s get the columns with taxonomic information from our eucalypts dataframe and our gbif_species_list to compare.

First, we can select columns containing taxonomic names in our ALA eucalypts dataframe (kingdom to species) and use distinct() to remove duplicate rows. This will leave us with one row for each distinct species in our dataset (very similar to a species list).

ala_names <- eucalypts |>
  select(kingdom:species) |>
  distinct()

ala_names
# A tibble: 312 × 7
   kingdom phylum     class         order    family    genus      species       
   <chr>   <chr>      <chr>         <chr>    <chr>     <chr>      <chr>         
 1 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus re…
 2 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus ca…
 3 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus co…
 4 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus la…
 5 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus pa…
 6 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus po…
 7 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus da…
 8 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus no…
 9 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus pl…
10 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus me…
# ℹ 302 more rows

Now let’s filter gbif_species_list to only “accepted” names2 and select the same taxonomic names columns.

gbif_names <- gbif_species_list |>
  filter(taxonomicStatus == "ACCEPTED") |> # accepted names
  select(kingdom:species) |> 
  select(!contains("Key")) |> # remove Key columns
  distinct()

gbif_names
1
We added distinct() to remove duplicate rows of species names. These duplicates appear because there might be multiple subspecies under the same species name. For example, Eucalyptus mannifera has 4 subspecies; Eucalyptus wimmerensis has 5. We aren’t interested in identifying species at that level, and so we remove these duplicates to simplify our species list.
# A tibble: 989 × 7
   kingdom phylum       class         order    family    genus      species     
   <chr>   <chr>        <chr>         <chr>    <chr>     <chr>      <chr>       
 1 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
 2 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
 3 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
 4 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
 5 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
 6 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
 7 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
 8 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
 9 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
10 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
# ℹ 979 more rows

We can merge our two names data frames together, matching by species name, which will allow us to compare them. We’ll distinguish which columns came from each data frame by appending an "_ala" or "_gbif" suffix to each column name.

matched_names <- ala_names |>
  left_join(gbif_names, 
            join_by(species == species), 
            suffix = c("_ala", "_gbif")) |>
  select(species, everything()) # reorder columns

matched_names now contains the full taxonomy from the ALA and GBIF for all matched species3.

rmarkdown::paged_table( # print paged table
  matched_names
  )

We are now ready to compare taxonomic names to find mismatches. We can start by finding any species with a mismatch in their kingdom name by filtering to return rows where kingdom_ala and kingdom_gbif are not equal. Our returned tibble is empty, meaning there were no mismatches.

matched_names |>
  filter(kingdom_ala != kingdom_gbif)
# A tibble: 0 × 13
# ℹ 13 variables: species <chr>, kingdom_ala <chr>, phylum_ala <chr>,
#   class_ala <chr>, order_ala <chr>, family_ala <chr>, genus_ala <chr>,
#   kingdom_gbif <chr>, phylum_gbif <chr>, class_gbif <chr>, order_gbif <chr>,
#   family_gbif <chr>, genus_gbif <chr>

If we do the same for phylum and class, however, we return quite a few results. It turns out that there is a difference between the ALA and GBIF in their higher taxonomic ranks of Eucalyptus plants.

In GBIF, Eucalyptus sits in the phylum Tracheophyta and the class Magnoliopsida

Code
# Use GBIF
galah_config(atlas = "gbif")

# Search for taxonomic information
gbif_taxa <- search_taxa("eucalyptus")

# Show relevant columns
gbif_taxa |>
  select(scientific_name, phylum, class, order)
# A tibble: 1 × 4
  scientific_name   phylum       class         order   
  <chr>             <chr>        <chr>         <chr>   
1 Eucalyptus L'Hér. Tracheophyta Magnoliopsida Myrtales

…whereas in the ALA, Eucalyptus sits in the phylum Charophyta and the class Equisetopsida.

Code
# Switch to download from the ALA
galah_config(atlas = "ala")

# Search for taxonomic information
ala_taxa <- search_taxa("Eucalyptus")

# Show relevant columns
ala_taxa |>
  select(scientific_name, phylum, class, order)
# A tibble: 1 × 4
  scientific_name phylum     class         order   
  <chr>           <chr>      <chr>         <chr>   
1 Eucalyptus      Charophyta Equisetopsida Myrtales

We might not know about this issue when we first decide to match GBIF’s taxonomic names to our data. So it’s important to investigate how well these names match (and where there are any mismatches) before merging them to our complete eucalypts data.

Now that we are aware of the differences between GBIF and ALA names, if we would like to use GBIF’s taxonomic names, we can join the columns with the suffix _gbif to our eucalypt occurrences data, and then replace the old taxonomic names columns with the GBIF names columns4.

eucalypts_updated_names <- matched_names |>
  # select columns and join to eucalypts data
  select(species, kingdom_gbif:genus_gbif) |>
  right_join(eucalypts,
             join_by(species == species)) |>
  select(-(kingdom:genus)) |> # remove ALA taxonomic columns
  rename_with(                # rename columns...
    ~ str_remove(., "_gbif"), # ...by removing "_gbif" suffix 
    kingdom_gbif:genus_gbif
    ) 

eucalypts_updated_names |> 
  rmarkdown::paged_table()    # paged table output

8.5 Detecting synonyms

Scientific discoveries and advances in taxonomic classification can cause taxonomic names to change. A taxonomic synonym refers to a scientific name that a taxon that goes by a different name. Synonyms can come about when a taxon was once referred to a taxon that has since had its accepted scientific name changed. Synonyms are important because older records referred to by a synonym can still be searched for and linked to other taxonomic records.

Synonyms can be tricky to deal with during data cleaning because they can be difficult to spot. Here are several examples of synonyms.

In the above examples, taxonomic searches match correctly on GBIF because GBIF uses a special, massive database of accepted and superseded names and synonyms. This massive names database allows GBIF to match lots of different names. ALA, on the other hand, uses a much smaller taxonomic names database that matches its current taxonomic backbone. This names database is smaller, making it easier to store, but less complete than GBIF’s.

Using tools like search_taxa() in galah is a useful way to check whether a search returns the taxonomic information you expect.

8.5.1 Checking for synonyms

Some species lists return accepted names and synonyms. For example, here is a species list of Eucalyptus downloaded from GBIF (which we used earlier in the chapter).

Download the gbif_species_list.parquet file from the Data in this book chapter.

Note: This is the original query to download this species list from GBIF. It takes several minutes to download, if you would like to download the most up-to-date version of this list.

library(galah)
gbif_species_list <- request_data("species") |>
  identify("Eucalyptus") |>
  collect()

gbif_species_list
gbif_species_list
# A tibble: 1,695 × 22
   taxonKey scientificName               acceptedTaxonKey acceptedScientificName
 *    <dbl> <chr>                                   <dbl> <chr>                 
 1  3176716 Eucalyptus calcicola Brooker          3176716 Eucalyptus calcicola …
 2  3176802 Eucalyptus salicola Brooker           3176802 Eucalyptus salicola B…
 3  3176920 Eucalyptus crebra F.Muell.            3176920 Eucalyptus crebra F.M…
 4  3177269 Eucalyptus stricta Sieber e…          3177269 Eucalyptus stricta Si…
 5  3717566 Eucalyptus alpina Lindl.              3717566 Eucalyptus alpina Lin…
 6  8164544 Eucalyptus hemiphloia var. …          7908015 Eucalyptus albens Miq.
 7  9292334 Eucalyptus goniocalyx subsp…          9292334 Eucalyptus goniocalyx…
 8 11127669 Eucalyptus griffithii Maiden         11127669 Eucalyptus griffithii…
 9  3176297 Eucalyptus camfieldii Maiden          3176297 Eucalyptus camfieldii…
10  3176473 Eucalyptus macrorhyncha sub…          3176473 Eucalyptus macrorhync…
# ℹ 1,685 more rows
# ℹ 18 more variables: numberOfOccurrences <dbl>, taxonRank <chr>,
#   taxonomicStatus <chr>, kingdom <chr>, kingdomKey <dbl>, phylum <chr>,
#   phylumKey <dbl>, class <chr>, classKey <dbl>, order <chr>, orderKey <dbl>,
#   family <chr>, familyKey <dbl>, genus <chr>, genusKey <dbl>, species <chr>,
#   speciesKey <dbl>, iucnRedListCategory <chr>

GBIF species lists include a taxonomicStatus column that supplies information of whether a taxonomic name is accepted or a synonym. A good example is the list of names for Eucalyptus leucoxylon, which has a number of accepted subspecies names and synonyms.

gbif_species_list |>
  filter(species == "Eucalyptus leucoxylon") |>
  select(species, taxonRank, taxonomicStatus, acceptedScientificName)
# A tibble: 18 × 4
   species               taxonRank  taxonomicStatus acceptedScientificName      
   <chr>                 <chr>      <chr>           <chr>                       
 1 Eucalyptus leucoxylon VARIETY    SYNONYM         Eucalyptus leucoxylon subsp…
 2 Eucalyptus leucoxylon SPECIES    SYNONYM         Eucalyptus leucoxylon subsp…
 3 Eucalyptus leucoxylon SUBSPECIES ACCEPTED        Eucalyptus leucoxylon subsp…
 4 Eucalyptus leucoxylon SPECIES    ACCEPTED        Eucalyptus leucoxylon F.Mue…
 5 Eucalyptus leucoxylon SUBSPECIES ACCEPTED        Eucalyptus leucoxylon subsp…
 6 Eucalyptus leucoxylon VARIETY    SYNONYM         Eucalyptus leucoxylon F.Mue…
 7 Eucalyptus leucoxylon SPECIES    SYNONYM         Eucalyptus leucoxylon subsp…
 8 Eucalyptus leucoxylon VARIETY    SYNONYM         Eucalyptus leucoxylon subsp…
 9 Eucalyptus leucoxylon SUBSPECIES ACCEPTED        Eucalyptus leucoxylon subsp…
10 Eucalyptus leucoxylon VARIETY    ACCEPTED        Eucalyptus leucoxylon var. …
11 Eucalyptus leucoxylon SUBSPECIES ACCEPTED        Eucalyptus leucoxylon subsp…
12 Eucalyptus leucoxylon SUBSPECIES ACCEPTED        Eucalyptus leucoxylon subsp…
13 Eucalyptus leucoxylon VARIETY    SYNONYM         Eucalyptus leucoxylon subsp…
14 Eucalyptus leucoxylon SUBSPECIES ACCEPTED        Eucalyptus leucoxylon subsp…
15 Eucalyptus leucoxylon SUBSPECIES ACCEPTED        Eucalyptus leucoxylon subsp…
16 Eucalyptus leucoxylon SPECIES    SYNONYM         Eucalyptus leucoxylon subsp…
17 Eucalyptus leucoxylon UNRANKED   ACCEPTED        SH0881366.09FU              
18 Eucalyptus leucoxylon VARIETY    SYNONYM         Eucalyptus leucoxylon subsp…

8.6 Detecting homonyms

Taxonomy is a complex field of science to categorise millions of species on the taxonomic tree. With so many species to name and order taxonomically, sometimes one name can have identical spelling to another name in an entirely different place on the taxonomic tree.

For example, the name Morganella is a genus of bacteria, a genus of fungi, a genus of scale insect, and a genus of brachiopod from the Devonian period5!

When you search for names with search_taxa() from the galah package, you’ll receive a warning that there is a homonym issue.

search_taxa("morganella")
Warning: Search returned multiple taxa due to a homonym issue.
ℹ Please provide another rank in your search to clarify taxa.
ℹ Use a `tibble` to clarify taxa, see `?search_taxa`.
✖ Homonym issue with "morganella".
# A tibble: 1 × 2
  search_term issues 
  <chr>       <chr>  
1 morganella  homonym

You can to clarify the taxonomic name by providing other taxonomic ranks in a tibble. Using the taxon_concept_id rather than the name will enable you to retrieve data using the correct classification.

taxa <- search_taxa(tibble(kingdom = "Fungi", genus = "Morganella"))

taxa |> rmarkdown::paged_table()
# Return record counts, grouped by species
galah_call() |>
  identify(taxa$taxon_concept_id) |>
  group_by(species) |>
  atlas_counts()
# A tibble: 2 × 2
  species                 count
  <chr>                   <int>
1 Morganella compacta        88
2 Morganella purpurascens    38

For more information on advanced taxonomic filtering in galah, you can read this vignette on the package website.

8.7 Packages

There are several packages available that can be used to query different taxonomic databases and check for synonyms.

Download the worms.csv file from the Data in this book chapter.

8.8 Input from experts

Programmatic solutions for validating taxonomy can only go so far. To obtain a high quality species list, it’s good practice to seek validation from experts. Museums or taxonomic societies are great sources of knowledge.

Here is a list of some Australian taxonomic society groups to help validate taxonomies.

8.8.1 Australian taxonomic society groups

VERTEBRATES

INVERTEBRATES

8.8.2 Global taxonomy


  1. Global Biodiversity Infrastructure Facility (GBIF)↩︎

  2. GBIF’s species list is quite comprehensive, and it includes the taxonomicStatus of a name as “accepted”, “synonym”, “variety” or “doubtful”. To keep our example simpler, we are only using the accepted names.↩︎

  3. Several species names did not match to GBIF. In a complete data cleaning workflow, these should be investigated as the ALA and GBIF might use synonym names to describe the same species or subspecies.↩︎

  4. There were some names that did not match GBIF, meaning their taxonomic columns contain NA values. Be sure to either fix these NA values before merging dataframes, or back-fill after merging dataframes. Otherwise, you might add missing data in your data set unintentionally!↩︎

  5. Referred to as “the Age of Fishes”, the Devonian Period occurred ~419 to ~359 million years ago.↩︎