# packages
library(here)
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(janitor)
library(galah)
galah_config(email = "your-email-here", # ALA-registered email
username = "your-email-here", # GBIF account email
password = "your-password-here") # GBIF account password
birds <- galah_call() |>
filter(doi == "https://doi.org/10.26197/ala.0f2a39f4-41ae-4cd3-a400-a6489cba17f0") |>
atlas_occurrences()
legless_lizards <- galah_call() |>
filter(doi == "https://doi.org/10.26197/ala.53669293-5568-40aa-ae53-9e7efcd834db") |>
atlas_occurrences()
inverts <- arrow::read_parquet(
here("path", "to", "inverts.parquet"))
eucalypts <- galah_call() |>
filter(doi == "https://doi.org/10.26197/ala.4cd034cb-83dc-4e6f-b1c1-22dfb522a64c") |>
atlas_occurrences()
gbif_species_list <- arrow::read_parquet(
here("path", "to", "gbif_eucalyptus.parquet"))8 Taxonomic validation
Taxonomic classification is in a state of constant change. Advances in methods, especially in molecular biology, have allowed researchers to describe new species more efficiently than ever before (Garraffoni et al. 2019). Modern approaches have also enabled the reclassification of organisms that were incorrectly described in the past. As new discoveries are made, taxonomies are frequently updated or amended.
This process of changing taxonomy makes working with open-source biodiversity data challenging. Views may differ within the literature or across authorities about which taxonomy is correct. In different countries, one taxonomy might better describe the native taxonomic diversity than others. Data infrastructures must also make choices about which taxonomic authorities to use, and different infrastructures inevitably make different decisions.
Another important issue is that when using data from multiple data providers, there is no guarantee that each data provider uses the same taxonomic resolution. If, for example, you are interested in viewing subspecies records for birds, many larger datasets don’t share their dataset with subspecies information included (e.g., Birdlife). This means that in order to look for records for a subspecies you may need to search at both the species and subspecies level to find all relevant records.
As a result, most taxonomic data will need checking and cleaning before analysis. You will encounter situations where the same taxon has several names (synonyms) or where the same name can refer to several entirely unrelated taxa (homonyms). These situations can be tricky to identify and therefore clean when working with taxonomic data.
While there is no perfect solution, some tips, tricks, and tools do exist. In this chapter we will go through some of these to clean taxonomic data, including ways of dealing with missing taxonomic information, and detecting synonyms and homonyms.
Cleaning taxonomic names can require a lot of changes! For every change, we recommend keeping detailed records of your modifications and your reasons for making those decisions.
8.0.1 Prerequisites
In this chapter we will use several datasets:
- Kingfisher (Alcedinidae) occurrence records from 2022 from the ALA
- Legless lizard (Pygopodidae) occurrence records from 2021-2023 from the ALA
- A subset of invertebrate occurrence records taken from the Curated Plant and Invertebrate Data for Bushfire Modelling data set, saved in the
inverts.parquetfile - Eucalyptus occurrence records from 2014 from the ALA
- Eucalyptus species list downloaded from GBIF, saved in the
gbif_species_list.parquetfile
Download the inverts.parquet and gbif_species_list.parquet files from the Data in this book chapter.
8.1 Preview names
One of the simplest ways to determine whether there are any immediate issues with taxonomic names is to preview a subset of the names. Most biodiversity datasets will have a field for the scientific names of taxa (e.g. scientificName, scientific_name), describing the lowest taxonomic level to which taxa have been identified. Looking at scientificName in our birds data, we can observe some characteristics of the names in this dataset, namely that:
- Records have been identified to different taxonomic ranks (family, genus, species, subspecies)
- Some names are in uppercase, others are in sentence case
- Where subgenera are included, they appear within parentheses
birds |>
distinct(scientificName) |>
print(n = 25)# A tibble: 23 × 1
scientificName
<chr>
1 Dacelo (Dacelo) novaeguineae
2 Todiramphus (Todiramphus) sanctus
3 Ceyx azureus
4 Todiramphus (Lazulena) macleayii
5 Dacelo (Dacelo) leachii
6 Tanysiptera (Uralcyon) sylvia
7 Ceyx pusillus
8 Todiramphus (Cyanalcyon) pyrrhopygius
9 Syma torotoro
10 Todiramphus
11 ALCEDINIDAE
12 Dacelo (Dacelo) novaeguineae novaeguineae
13 Dacelo (Dacelo) leachii leachii
14 Todiramphus (Todiramphus) sanctus sanctus
15 Todiramphus (Todiramphus) chloris
16 Todiramphus (Todiramphus) sanctus vagans
17 Todiramphus (Lazulena) macleayii macleayii
18 Dacelo (Dacelo) leachii occidentalis
19 Ceyx azureus azureus
20 Ceyx azureus diemenensis
21 Ceyx azureus ruficollaris
22 Dacelo
23 Todiramphus (Lazulena) macleayii incinctus
8.2 Name format
Different data providers might use different formats in their taxonomic names to delineate between taxonomic ranks. It doesn’t matter which format your data uses as long as it is consistent.
Example 1: Subspecies
As an example, the ALA uses "subsp." to designate subspecies of Acacia observations in the scientific name, whereas subspecies of bird observations simply include the subspecific epithet after the specific epithet.
acacia_2018 <- galah_call() |>
identify("Acacia") |>
filter(year == 2018) |>
atlas_occurrences()
acacia_2018 |>
filter(str_detect(scientificName, "Acacia brunioides")) |>
distinct(scientificName)# A tibble: 2 × 1
scientificName
<chr>
1 Acacia brunioides subsp. brunioides
2 Acacia brunioides
birds_2023 <- galah_call() |>
identify("alcedinidae") |>
filter(year == 2023) |>
atlas_occurrences()
birds_2023 |>
filter(str_detect(scientificName, "Dacelo")) |>
distinct(scientificName)# A tibble: 5 × 1
scientificName
<chr>
1 Dacelo (Dacelo) novaeguineae
2 Dacelo (Dacelo) leachii
3 Dacelo (Dacelo) novaeguineae novaeguineae
4 Dacelo (Dacelo) leachii leachii
5 Dacelo
Although both are correct, be sure to check your data to make sure that this naming format is consistent. Other taxonomic names (like subgenera) can differ between taxonomic groups, and also between sources.
8.3 Matching names to a species list
Many investigations rely on taxonomic lists of species or groups to identify relevant species. A common example is using lists of introduced, invasive, threatened, or sensitive species to identify records of interest.
There are several ways to filter records to match names on a species list. First, we’ll use a species list accessed via galah to filter records, which also provides additional functionality for filtering data prior to download. Then, we’ll use an external species list loaded into R to filter records.
galah
The ALA contains both national and state-based threatened and sensitive lists. For example, if we want to use the Victorian Restricted Species list, which is the list of species in Victoria for which all records have been generalised, we can perform a text search for available lists using the term “victoria” with search_all(lists, "victoria").
list_search <- search_all(lists, "victoria")
list_search# A tibble: 50 × 22
species_list_uid listName description listType dateCreated lastUpdated
<chr> <chr> <chr> <chr> <chr> <chr>
1 dr1266 "2 b) Protecti… "List gene… LOCAL_L… 2014-07-31… 2017-02-15…
2 dr1782 "Advisory List… "Purpose o… CONSERV… 2014-10-27… 2022-03-16…
3 dr967 "Advisory List… "The advis… CONSERV… 2013-11-12… 2023-06-12…
4 dr2504 "ALT Waterbug … "Agreed Le… LOCAL_L… 2015-09-08… 2016-06-14…
5 dr28924 "Briza (Victor… <NA> OTHER 2024-11-10… 2024-11-10…
6 dr32852 "Climacteris p… <NA> OTHER 2025-09-18… 2025-09-18…
7 dr2683 "Dung beetles … "Dung beet… LOCAL_L… 2016-01-15… 2020-08-20…
8 dr4890 "Endangered Pl… "" CONSERV… 2016-05-07… 2016-06-14…
9 dr17134 "Endangered Sp… <NA> CONSERV… 2021-03-30… 2022-11-21…
10 dr6635 "Gippsland’s N… "List gene… LOCAL_L… 2016-11-15… 2016-11-15…
# ℹ 40 more rows
# ℹ 16 more variables: lastUploaded <chr>, lastMatched <chr>, username <chr>,
# itemCount <int>, region <chr>, isAuthoritative <lgl>, isInvasive <lgl>,
# isThreatened <lgl>, isBIE <lgl>, isSDS <lgl>, wkt <chr>, authority <chr>,
# sdsType <chr>, category <chr>, looseSearch <lgl>, generalisation <chr>
Filtering our results to authoritative lists only can help us find official state lists.
list_search |>
filter(isAuthoritative == TRUE)# A tibble: 8 × 22
species_list_uid listName description listType dateCreated lastUpdated
<chr> <chr> <chr> <chr> <chr> <chr>
1 dr1148 Museums Field G… "Species p… SPECIES… 2014-06-23… 2024-01-18…
2 dr1147 Museums Field G… "Species p… SPECIES… 2014-06-23… 2024-01-18…
3 dr1145 Museums Field G… "Species p… SPECIES… 2014-06-23… 2024-01-18…
4 dr1146 Museums Field G… "Species p… SPECIES… 2014-06-23… 2024-01-18…
5 dr3320 Protologues of … "This a li… SPECIES… 2016-02-11… 2024-04-29…
6 dr882 VIC State Notif… "Species l… SENSITI… 2013-06-23… 2023-03-15…
7 dr655 Victoria : Cons… "" CONSERV… 2015-04-04… 2025-09-09…
8 dr490 Victorian Restr… "Categorie… SENSITI… 2013-06-23… 2025-09-09…
# ℹ 16 more variables: lastUploaded <chr>, lastMatched <chr>, username <chr>,
# itemCount <int>, region <chr>, isAuthoritative <lgl>, isInvasive <lgl>,
# isThreatened <lgl>, isBIE <lgl>, isSDS <lgl>, wkt <chr>, authority <chr>,
# sdsType <chr>, category <chr>, looseSearch <lgl>, generalisation <chr>
Now that we’ve found our list, we can view the contents of the list using show_values().
vic_species_list <- search_all(lists, "dr490") |>
show_values()- 1
-
We are using the list ID
dr490(specified in thespecies_list_uidcolumn) to make sure we return the correct list
• Showing values for 'dr490'.
vic_species_list# A tibble: 138 × 6
id name commonName scientificName lsid dataResourceUid
<int> <chr> <chr> <chr> <chr> <chr>
1 7071569 Engaeus australis Lilly Pil… Engaeus austr… http… dr490
2 7071518 Engaeus fultoni Otway Bur… Engaeus fulto… http… dr490
3 7071572 Engaeus mallacoota Mallacoot… Engaeus malla… http… dr490
4 7071573 Engaeus phyllocercus Narracan … Engaeus phyll… http… dr490
5 7071574 Engaeus rostrogaleat… Strzeleck… Engaeus rostr… http… dr490
6 7071502 Engaeus sericatus Hairy Bur… Engaeus seric… http… dr490
7 7071498 Engaeus sternalis Warragul … Engaeus stern… http… dr490
8 7071540 Engaeus strictifrons Portland … Engaeus stric… http… dr490
9 7071489 Engaeus urostrictus Dandenong… Engaeus urost… http… dr490
10 7071503 Euastacus bidawalus East Gipp… Euastacus bid… http… dr490
# ℹ 128 more rows
We can now compare the taxa in vic_species_list to those in our legless_lizards dataset to identify any species whose records have been generalised.
legless_lizards_filtered <- legless_lizards |>
filter(!scientificName %in% vic_species_list$scientificName)
legless_lizards_filtered# A tibble: 3,792 × 8
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 001129f4-4824… Pygopus lepid… https://biodi… -34.0 151.
2 0013d553-af89… Lialis burton… https://biodi… -34.0 140.
3 0031c737-922a… Pygopus lepid… https://biodi… -36.0 150.
4 0055b6e3-11e1… Lialis burton… https://biodi… -14.1 143.
5 005c83ad-709e… Delma molleri https://biodi… -35.1 139.
6 005dfdd2-4a93… Lialis burton… https://biodi… -29.1 152.
7 0063af2c-e070… Pygopus lepid… https://biodi… -34.9 139.
8 006de4b6-880d… Delma austral… https://biodi… -33.6 141.
9 007a4f5f-b2ee… Lialis burton… https://biodi… -16.7 146.
10 0081dbcb-6af7… Delma impar https://biodi… -36.3 149.
# ℹ 3,782 more rows
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>
This process has removed more than 140 records from our dataset.
nrow(legless_lizards) - nrow(legless_lizards_filtered)[1] 190
We can also filter our queries prior to downloading data in galah by adding a filter specifying species_list_uid == dr490 to our query. We’ll also filter our result to only records in Victoria as this is a state-specific list.
galah_call() |>
identify("Pygopodidae") |>
filter(species_list_uid == dr490,
stateProvince == "Victoria") |>
group_by(species) |>
atlas_counts()- 1
-
We are using the list ID
dr490(specified in thespecies_list_uidcolumn) to make sure we return the correct list
# A tibble: 2 × 2
species count
<chr> <int>
1 Aprasia parapulchella 51
2 Aprasia aurita 32
It’s important to consider the level of taxonomic identification in a given list; some contain genus, species and subspecies names. The results above are limited to return only names matched to the species level, but to return the lowest taxonomic name identified we can group by scientificName (this is a bad example because the lowest level of identification is the species level).
galah_call() |>
identify("Pygopodidae") |>
filter(species_list_uid == dr490,
stateProvince == "Victoria") |>
group_by(scientificName) |>
atlas_counts()# A tibble: 2 × 2
scientificName count
<chr> <int>
1 Aprasia parapulchella 51
2 Aprasia aurita 32
Some lists like state conservation lists contain species and subspecies names. Take, for example, the New South Wales threatened species list (dr650).
search_all(lists, "dr650") |>
show_values() |>
head()• Showing values for 'dr650'.
# A tibble: 6 × 6
id name commonName scientificName lsid dataResourceUid
<int> <chr> <chr> <chr> <chr> <chr>
1 7060798 Delma impar Striped L… Delma impar http… dr650
2 7060516 Callocephalon fimbria… Gang-gang… Callocephalon… http… dr650
3 7061059 Cacophis harriettae White-cro… Cacophis harr… http… dr650
4 7060630 Litoria booroolongens… Booroolon… Litoria booro… http… dr650
5 7060473 Anthochaera phrygia Regent Ho… Anthochaera (… http… dr650
6 7061040 Calidris tenuirostris Great Knot Calidris (Cal… http… dr650
If we wish to use this threatened species list to filter a query, we’ll need to ensure we group_by() either scientificName or taxonConceptID. Both fields capture the lowest taxonomic identification, and will match a list with names at multiple taxonomic levels appropriately. For example, grouping by taxonConceptID will return taxonomic information of matching species and subspecies (see this ALA Labs post for more information).
species_shoalhaven <- galah_call() |>
filter(cl11170 == "Shoalhaven",
year == 2024) |>
group_by(taxonConceptID) |>
atlas_species()
species_shoalhaven |> print(n = 10)# A tibble: 4,335 × 11
taxon_concept_id species_name scientific_name_auth…¹ taxon_rank kingdom
<chr> <chr> <chr> <chr> <chr>
1 https://biodiversity.… Phascolarct… (Goldfuss, 1817) species Animal…
2 https://biodiversity.… Gymnorhina … (Latham, 1801) species Animal…
3 https://biodiversity.… Malurus (Ma… (Ellis, 1782) species Animal…
4 https://biodiversity.… Macropus gi… Shaw, 1790 species Animal…
5 https://biodiversity.… Corvus coro… Vigors & Horsfield, 1… species Animal…
6 https://biodiversity.… Trichogloss… Stephens, 1826 genus Animal…
7 https://biodiversity.… Vanellus (L… (Boddaert, 1783) species Animal…
8 https://biodiversity.… Anthochaera… (Latham, 1801) species Animal…
9 https://biodiversity.… Dacelo (Dac… (Hermann, 1783) species Animal…
10 https://biodiversity.… Potorous tr… (McCoy, 1865) subspecies Animal…
# ℹ 4,325 more rows
# ℹ abbreviated name: ¹scientific_name_authorship
# ℹ 6 more variables: phylum <chr>, class <chr>, order <chr>, family <chr>,
# genus <chr>, vernacular_name <chr>
Using an external list
We can also use lists downloaded outside of galah to filter our data. As an example, let’s filter our taxonomic names to include only Australian names from the Global Register of Introduced and Invasive Species (GRIIS). After downloading this list and saving it in your working directory, we can read the list into R. Taxonomic names are stored in columns with an accepted_name prefix.
griis <- read_csv(here("griis_australia_20240712.csv"))
glimpse(griis)- 1
-
We renamed the downloaded file from
record 20240712-155356.csvtogriis_australia_20240712.csv
Rows: 2,979
Columns: 16
$ scientific_name <chr> "Oenothera longiflora L.", "Lampranth…
$ scientific_name_type <chr> "species", "species", "species", "spe…
$ kingdom <chr> "Plantae", "Plantae", "Plantae", "Pla…
$ establishment_means <chr> "alien", "alien", "alien", "alien", "…
$ is_invasive <chr> "null", "null", "null", "null", "null…
$ occurrence_status <chr> "present", "present", "present", "pre…
$ checklist.name <chr> "Australia", "Australia", "Australia"…
$ checklist.iso_countrycode_alpha3 <chr> "AUS", "AUS", "AUS", "AUS", "AUS", "A…
$ accepted_name.species <chr> "Oenothera longiflora", "Lampranthus …
$ accepted_name.kingdom <chr> "Plantae", "Plantae", "Plantae", "Pla…
$ accepted_name.phylum <chr> "Tracheophyta", "Tracheophyta", "Trac…
$ accepted_name.class <chr> "Magnoliopsida", "Magnoliopsida", "Ma…
$ accepted_name.order <chr> "Myrtales", "Caryophyllales", "Erical…
$ accepted_name.family <chr> "Onagraceae", "Aizoaceae", "Ericaceae…
$ accepted_name.habitat <chr> "[\"terrestrial\"]", "[\"terrestrial\…
$ accepted_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Now we can check which species names in our legless_lizards dataset match names in griis.
# Check which species matched the GRIIS list
matches <- legless_lizards |>
filter(scientificName %in% griis$accepted_name.species)
matches# A tibble: 0 × 8
# ℹ 8 variables: recordID <chr>, scientificName <chr>, taxonConceptID <chr>,
# decimalLatitude <dbl>, decimalLongitude <dbl>, eventDate <dttm>,
# occurrenceStatus <chr>, dataResourceName <chr>
After reviewing the matches and confirming we’re happy with the list of matched species, we can exclude these taxa from our data by removing the identified rows.
legless_lizards_filtered <- legless_lizards |>
filter(!scientificName %in% matches)
legless_lizards_filtered# A tibble: 3,982 × 8
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 001129f4-4824… Pygopus lepid… https://biodi… -34.0 151.
2 0013d553-af89… Lialis burton… https://biodi… -34.0 140.
3 0027660b-3e75… Aprasia parap… https://biodi… -35.4 149
4 0031c737-922a… Pygopus lepid… https://biodi… -36.0 150.
5 0055b6e3-11e1… Lialis burton… https://biodi… -14.1 143.
6 005c83ad-709e… Delma molleri https://biodi… -35.1 139.
7 005dfdd2-4a93… Lialis burton… https://biodi… -29.1 152.
8 0063af2c-e070… Pygopus lepid… https://biodi… -34.9 139.
9 006de4b6-880d… Delma austral… https://biodi… -33.6 141.
10 007a4f5f-b2ee… Lialis burton… https://biodi… -16.7 146.
# ℹ 3,972 more rows
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>
You can apply this concept of filtering to any list of species, or other fields, that you would like to exclude.
8.4 Taxonomic names matching
8.4.1 When a name match returns a different result than expected
Names matching can be tricky. This can be caused by taxonomy itself (e.g. changes in scientific names, synonyms, or homonyms) or simple errors (e.g. spelling errors in scientific names, the incorrect Authority being attached to a scientific name). Sometimes a name might match to an incorrect result, usually caused by an accidental text match. Other times, a name might match to a correct result, but the result is unexpected (and might cause confusion later if it goes unrecognised). This scenario usually occurs when a taxonomic name is no longer accepted. The following sections will elaborate on each scenario.
Match returns incorrect result
Returning an incorrect match to a taxonomic name most often happens when doing a taxonomic search, like with search_taxa(). For example, let’s say we are interested in returning taxonomic information for the Tasmanian population of the Eastern Barred Bandicoot (Perameles gunnii gunnii). If we do a quick search for the two unique words in its name we return a search result, but the result is information on the species Perameles gunnii rather than the subspecies.
search_taxa("perameles gunnii")# A tibble: 1 × 15
search_term scientific_name scientific_name_auth…¹ taxon_concept_id rank
<chr> <chr> <chr> <chr> <chr>
1 perameles gunnii Perameles gunn… Gray, 1838 https://biodive… spec…
# ℹ abbreviated name: ¹scientific_name_authorship
# ℹ 10 more variables: match_type <chr>, kingdom <chr>, phylum <chr>,
# class <chr>, order <chr>, family <chr>, genus <chr>, species <chr>,
# vernacular_name <chr>, issues <chr>
Searching for the entire name, however, provides the subspecies information.
search_taxa("perameles gunnii gunnii")# A tibble: 1 × 14
search_term scientific_name taxon_concept_id rank match_type kingdom phylum
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 perameles gu… Perameles gunn… ALA_DR22912_429 subs… exactMatch Animal… Chord…
# ℹ 7 more variables: class <chr>, order <chr>, family <chr>, genus <chr>,
# species <chr>, vernacular_name <chr>, issues <chr>
Although fixing our search is easy in this situation, there are instances when this error might be less obvious. For example, we might wish to use a function like atlas_species() to return taxonomic information about many species all at once. However, atlas_species() matches to species-level and will return species information by default.
galah_call() |>
identify("perameles gunnii gunnii") |>
atlas_species() |>
select(1:4)# A tibble: 1 × 4
taxon_concept_id species_name scientific_name_auth…¹ taxon_rank
<chr> <chr> <chr> <chr>
1 https://biodiversity.org.au/af… Perameles g… Gray, 1838 species
# ℹ abbreviated name: ¹scientific_name_authorship
We must specify group_by(taxonConceptID) to avoid returning species-level information only.
galah_call() |>
identify("perameles gunnii gunnii") |>
group_by(taxonConceptID) |>
atlas_species() |>
select(1:4)# A tibble: 1 × 4
taxon_concept_id species_name scientific_name_authorship taxon_rank
<chr> <chr> <lgl> <chr>
1 ALA_DR22912_429 Perameles gunnii gunnii NA subspecies
Be mindful about whether any taxonomic names you are interested in might return information at the wrong taxonomic level by mistake. Amending searches or replacing incorrect matches with correct information may be required to ensure taxonomic matches are correct.
Match returns correct but unexpected result
It’s also possible that a taxonomic name returns a different match to what we might expect, but this is the correct match for the name. For example, a search for the Eucalypt species Eucalyptus x studleyensis returns the genus Eucalyptus. This is the correct result. Even though Eucalyptus x studleyensis is recognised as a threatened species in Victoria, it is not recognised as a separate species by the combined view of the herbaria of Australia, the Australian Plant Census (APC), so a search will correctly match the the higher taxonomic genus level.
search_taxa("Eucalyptus x studleyensis")# A tibble: 1 × 14
search_term scientific_name scientific_name_auth…¹ taxon_concept_id rank
<chr> <chr> <chr> <chr> <chr>
1 Eucalyptus x st… Eucalyptus L'Hér. https://id.biod… genus
# ℹ abbreviated name: ¹scientific_name_authorship
# ℹ 9 more variables: match_type <chr>, kingdom <chr>, phylum <chr>,
# class <chr>, order <chr>, family <chr>, genus <chr>, vernacular_name <chr>,
# issues <chr>
An issue arises because existing regional lists like the Victorian threatened species list (dr655) still contain the name Eucalyptus x studleyensis because it is still considered a threatened species at state level.
search_all(lists, "Victoria : Conservation Status") |>
search_values("Eucalyptus x studleyensis")• Showing values for 'dr655'.
# A tibble: 1 × 6
id name commonName scientificName lsid dataResourceUid
<int> <chr> <chr> <chr> <chr> <chr>
1 7064311 Eucalyptus X studleye… Studley P… Eucalyptus http… dr655
Despite this species list being authoritative, if we attempt to use it to search for matching occurrence records of the name Eucalyptus x studleyensis on the Atlas of Living Australia, we will inevitably match to the entire genus Eucalyptus and unintentionally return occurrence records for the entire genus.
galah_call() |>
identify("Eucalyptus x studleyensis") |>
group_by(scientificName) |>
atlas_counts()# A tibble: 1,193 × 2
scientificName count
<chr> <int>
1 Eucalyptus 45068
2 Eucalyptus obliqua 43136
3 Eucalyptus camaldulensis 42045
4 Eucalyptus sieberi 25230
5 Eucalyptus melliodora 24275
6 Eucalyptus crebra 23180
7 Eucalyptus globoidea 22715
8 Eucalyptus macrorhyncha 21885
9 Eucalyptus tereticornis 21535
10 Eucalyptus muelleriana 18831
# ℹ 1,183 more rows
The solution here is a little less clear. Depending on the goals of the investigation, it might be easiest to remove this species from the list entirely to avoid mismatching. Alternatively, limiting a search for Eucalyptus to a specific region of interest might be a more targeted solution, though careful cleaning of unwanted records would still be required.
8.4.2 Missing higher taxonomic information
It’s not uncommon to receive data that are missing information at some taxonomic levels, but this can make it tricky to summarise data or create visualisations based on taxonomy later on. It can also interfere with names matching processes if there are homonyms.
As an example, here is a small sample of our inverts dataset. You’ll notice that we only have information on scientific_name, class, and family.
inverts_sample <- inverts |>
slice(1234:1271)
inverts_sample |> print(n = 5)# A tibble: 38 × 9
record_id scientific_name class family year latitude longitude sensitive
<chr> <chr> <chr> <chr> <int> <dbl> <dbl> <int>
1 76213a64-ed41… Helicotylenchu… chro… hoplo… NA -23.1 151. 0
2 e74ec2f0-4cef… Iravadia (Irav… gast… irava… 1903 -16.5 140. 0
3 340c2b82-6b85… Monomorium bic… inse… formi… 1998 -24.7 150. 0
4 e7dc1fa1-6524… Saprosites men… inse… scara… 2004 -43.1 147. 0
5 316ad303-efc6… Amitermes darw… inse… termi… 1953 -21.9 118. 0
# ℹ 33 more rows
# ℹ 1 more variable: project <chr>
One way to fill in values at the missing taxonomic levels (e.g. phylum, order) is to get this information from a data infrastructure like the ALA, which has its own taxonomic backbone. We’ll start by extracting the scientific names of taxa in inverts_sample and saving these as taxa_sample_names.
taxa_sample_names <- inverts_sample |>
select(scientific_name) |>
distinct() |>
pull()
taxa_sample_names[1:5] # first 5 names[1] "Helicotylenchus multicinctus" "Iravadia (Iravadia) carpentariensis"
[3] "Monomorium bicorne" "Saprosites mendax"
[5] "Amitermes darwini"
We can then search for those names in the ALA using using search_taxa() from galah. We’ll save the results in names_matches_ala. The results contain complete taxonomic information from kingdom to species.
Anytime you search for taxonomic matches using names, it’s good practice to double check the urls returned in taxon_concept_id to make sure your results match the names you expected!
names_matches_ala <- search_taxa(taxa_sample_names)
names_matches_ala# A tibble: 38 × 15
search_term scientific_name scientific_name_auth…¹ taxon_concept_id rank
<chr> <chr> <chr> <chr> <chr>
1 Helicotylenchu… Helicotylenchu… (Cobb, 1893) https://biodive… spec…
2 Iravadia (Irav… Iravadia (Irav… (Hedley, 1912) https://biodive… spec…
3 Monomorium bic… Chelaner bicor… (Forel, 1907) https://biodive… spec…
4 Saprosites men… Saprosites men… (Blackburn, 1892) https://biodive… spec…
5 Amitermes darw… Amitermes darw… (Hill, 1922) https://biodive… spec…
6 Schedorhinoter… Schedorhinoter… (Hill, 1933) https://biodive… spec…
7 Sorama bicolor Sorama bicolor Walker, 1855 https://biodive… spec…
8 Windbalea warr… Windbalea warr… Rentz, 1993 https://biodive… spec…
9 Tholymis tilla… Tholymis tilla… (Fabricius, 1798) https://biodive… spec…
10 Costellipitar … Costellipitar … (Hedley, 1923) https://biodive… spec…
# ℹ 28 more rows
# ℹ abbreviated name: ¹scientific_name_authorship
# ℹ 10 more variables: match_type <chr>, kingdom <chr>, phylum <chr>,
# class <chr>, order <chr>, family <chr>, genus <chr>, species <chr>,
# vernacular_name <chr>, issues <chr>
Now we can merge this information into our inverts_sample dataset.
First, let’s select the columns from names_matches_ala that we want, and rename those so we can differentiate between the columns in inverts_sample and the ones we just downloaded using galah. We’ll suffix the columns in names_matches_ala with "_ala".
names_matches_renamed <- names_matches_ala |>
select(scientific_name, kingdom:species) |>
rename_with(\(column_name) paste0(column_name, "_ala"),
kingdom:species)
names_matches_renamed- 1
-
This line uses shorthand to write a function to append a suffix to a column name. An equivalent way of writing this is:
function(column_name) {paste0(column_name, "_ala)}
This is applied to each column name fromkingdomtospeciesin thenames_matches_aladataframe.
# A tibble: 38 × 8
scientific_name kingdom_ala phylum_ala class_ala order_ala family_ala
<chr> <chr> <chr> <chr> <chr> <chr>
1 Helicotylenchus multic… Animalia Nematoda Chromado… Panagrol… Hoplolaim…
2 Iravadia (Iravadia) ca… Animalia Mollusca Gastropo… Hypsogas… Iravadiid…
3 Chelaner bicorne Animalia Arthropoda Insecta Hymenopt… Formicidae
4 Saprosites mendax Animalia Arthropoda Insecta Coleopte… Scarabaei…
5 Amitermes darwini Animalia Arthropoda Insecta Blattodea Termitidae
6 Schedorhinotermes actu… Animalia Arthropoda Insecta Blattodea Rhinoterm…
7 Sorama bicolor Animalia Arthropoda Insecta Lepidopt… Notodonti…
8 Windbalea warrooa Animalia Arthropoda Insecta Orthopte… Tettigoni…
9 Tholymis tillarga Animalia Arthropoda Insecta Odonata Libelluli…
10 Costellipitar inconsta… Animalia Mollusca Bivalvia Cardiida Veneridae
# ℹ 28 more rows
# ℹ 2 more variables: genus_ala <chr>, species_ala <chr>
Now let’s join our matched names in names_matches_renamed to our inverts_sample data. This adds all higher taxonomic names columns to our inverts_sample data.
inverts_sample_with_ranks <- names_matches_renamed |>
right_join(inverts_sample,
join_by(scientific_name == scientific_name))
inverts_sample_with_ranks# A tibble: 38 × 16
scientific_name kingdom_ala phylum_ala class_ala order_ala family_ala
<chr> <chr> <chr> <chr> <chr> <chr>
1 Helicotylenchus multic… Animalia Nematoda Chromado… Panagrol… Hoplolaim…
2 Iravadia (Iravadia) ca… Animalia Mollusca Gastropo… Hypsogas… Iravadiid…
3 Saprosites mendax Animalia Arthropoda Insecta Coleopte… Scarabaei…
4 Amitermes darwini Animalia Arthropoda Insecta Blattodea Termitidae
5 Schedorhinotermes actu… Animalia Arthropoda Insecta Blattodea Rhinoterm…
6 Sorama bicolor Animalia Arthropoda Insecta Lepidopt… Notodonti…
7 Windbalea warrooa Animalia Arthropoda Insecta Orthopte… Tettigoni…
8 Tholymis tillarga Animalia Arthropoda Insecta Odonata Libelluli…
9 Costellipitar inconsta… Animalia Mollusca Bivalvia Cardiida Veneridae
10 Placamen lamellosum Animalia Mollusca Bivalvia Cardiida Veneridae
# ℹ 28 more rows
# ℹ 10 more variables: genus_ala <chr>, species_ala <chr>, record_id <chr>,
# class <chr>, family <chr>, year <int>, latitude <dbl>, longitude <dbl>,
# sensitive <int>, project <chr>
We can verify the join worked correctly by checking that names in the original family column are identical to those in the new family_ala column. If there were mismatches, the join would produce more rows than initially occurred in inverts_sample: rows not matching to a scientific name would have returned columns with NA values, which would not join to those in inverts_sample.
To double check that our join worked correctly by making sure names in our original family column all match our new family_ala column. If the join did not work correctly, we would expect many rows to be returned because there would be NA values in any rows that didn’t match a scientific_name.
Nothing is returned, meaning the names in family_ala and family all match and our join worked correctly!
inverts_sample_with_ranks |>
select(scientific_name, family_ala, family) |>
mutate(family = stringr::str_to_sentence(family)) |> # match formatting
filter(family_ala != family)# A tibble: 0 × 3
# ℹ 3 variables: scientific_name <chr>, family_ala <chr>, family <chr>
8.4.3 Identifying mismatches in species lists
Higher taxonomy from different data providers may not always match. If this is the case, you will need to back-fill the higher taxonomic ranks using data from your preferred taxonomic naming authority.
Let’s use data of Eucalyptus observations we downloaded from the ALA as an example.
eucalypts# A tibble: 13,225 × 16
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 0009ba6a-8e8e… Eucalyptus re… https://id.bi… -17.6 145.
2 000b496d-61ad… Eucalyptus ma… https://id.bi… -35.5 149.
3 0014f2cd-5985… Eucalyptus ro… https://id.bi… -35.5 149.
4 001ee2e9-d353… Eucalyptus pu… https://id.bi… -32.8 152.
5 002771aa-02b9… Eucalyptus mu… https://id.bi… -38.5 147.
6 002b74ab-b8ce… Eucalyptus ca… https://id.bi… -34.2 141.
7 002bde6c-3a7f… Eucalyptus co… https://id.bi… -30.1 146.
8 002cb2ce-c8a1… Eucalyptus ca… https://id.bi… -37.1 141.
9 0031022c-8e9e… Eucalyptus la… https://id.bi… -34.4 142.
10 00407506-383e… Eucalyptus pa… https://id.bi… -34.1 151.
# ℹ 13,215 more rows
# ℹ 11 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, kingdom <chr>, phylum <chr>, class <chr>,
# order <chr>, family <chr>, genus <chr>, species <chr>, taxonRank <chr>
This occurrence data contains observations of over 300 species.
eucalypts |>
filter(taxonRank != "genus") |>
distinct(species) |>
count(name = "n_species")# A tibble: 1 × 1
n_species
<int>
1 305
Let’s say we want to compare these observations to data retrieved outside of the ALA and decide that we’d prefer to use GBIF’s1 taxonomy. ALA data uses a taxonomic backbone based on the National Species List, maintained by the Australian Biological Resources Study, that differs from GBIF’s, so we will need to amend our taxonomic names to match GBIF’s.
Let’s go through the steps to match our taxonomy in our eucalypts data to GBIF’s taxonomy. We can download a species list of Eucalyptus from GBIF. This list returns nearly 1,700 species names.
Download the gbif_species_list.parquet file from the Data in this book chapter.
gbif_species_list# A tibble: 1,695 × 22
taxonKey scientificName acceptedTaxonKey acceptedScientificName
* <dbl> <chr> <dbl> <chr>
1 3176716 Eucalyptus calcicola Brooker 3176716 Eucalyptus calcicola …
2 3176802 Eucalyptus salicola Brooker 3176802 Eucalyptus salicola B…
3 3176920 Eucalyptus crebra F.Muell. 3176920 Eucalyptus crebra F.M…
4 3177269 Eucalyptus stricta Sieber e… 3177269 Eucalyptus stricta Si…
5 3717566 Eucalyptus alpina Lindl. 3717566 Eucalyptus alpina Lin…
6 8164544 Eucalyptus hemiphloia var. … 7908015 Eucalyptus albens Miq.
7 9292334 Eucalyptus goniocalyx subsp… 9292334 Eucalyptus goniocalyx…
8 11127669 Eucalyptus griffithii Maiden 11127669 Eucalyptus griffithii…
9 3176297 Eucalyptus camfieldii Maiden 3176297 Eucalyptus camfieldii…
10 3176473 Eucalyptus macrorhyncha sub… 3176473 Eucalyptus macrorhync…
# ℹ 1,685 more rows
# ℹ 18 more variables: numberOfOccurrences <dbl>, taxonRank <chr>,
# taxonomicStatus <chr>, kingdom <chr>, kingdomKey <dbl>, phylum <chr>,
# phylumKey <dbl>, class <chr>, classKey <dbl>, order <chr>, orderKey <dbl>,
# family <chr>, familyKey <dbl>, genus <chr>, genusKey <dbl>, species <chr>,
# speciesKey <dbl>, iucnRedListCategory <chr>
To investigate whether the complete taxonomy—from kingdom to species—matches between our ALA data and GBIF species list, let’s get the columns with taxonomic information from our eucalypts dataframe and our gbif_species_list to compare.
First, we can select columns containing taxonomic names in our ALA eucalypts dataframe (kingdom to species) and use distinct() to remove duplicate rows. This will leave us with one row for each distinct species in our dataset (very similar to a species list).
ala_names <- eucalypts |>
select(kingdom:species) |>
distinct()
ala_names# A tibble: 306 × 7
kingdom phylum class order family genus species
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus re…
2 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus ma…
3 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus ro…
4 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus pu…
5 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus mu…
6 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus ca…
7 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus co…
8 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus la…
9 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus pa…
10 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus po…
# ℹ 296 more rows
Now let’s filter gbif_species_list to only “accepted” names2 and select the same taxonomic names columns.
gbif_names <- gbif_species_list |>
filter(taxonomicStatus == "ACCEPTED") |> # accepted names
select(kingdom:species) |>
select(!contains("Key")) |> # remove Key columns
distinct()
gbif_names- 1
-
We added
distinct()to remove duplicate rows of species names. These duplicates appear because there might be multiple subspecies under the same species name. For example, Eucalyptus mannifera has 4 subspecies; Eucalyptus wimmerensis has 5. We aren’t interested in identifying species at that level, and so we remove these duplicates to simplify our species list.
# A tibble: 989 × 7
kingdom phylum class order family genus species
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
2 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
3 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
4 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
5 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
6 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
7 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
8 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
9 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
10 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
# ℹ 979 more rows
We can merge our two names data frames together, matching by species name, which will allow us to compare them. We’ll distinguish which columns came from each data frame by appending an "_ala" or "_gbif" suffix to each column name.
matched_names <- ala_names |>
left_join(gbif_names,
join_by(species == species),
suffix = c("_ala", "_gbif")) |>
select(species, everything()) # reorder columnsmatched_names now contains the full taxonomy from the ALA and GBIF for all matched species3.
rmarkdown::paged_table( # print paged table
matched_names
)We are now ready to compare taxonomic names to find mismatches. We can start by finding any species with a mismatch in their kingdom name by filtering to return rows where kingdom_ala and kingdom_gbif are not equal. Our returned tibble is empty, meaning there were no mismatches.
matched_names |>
filter(kingdom_ala != kingdom_gbif)# A tibble: 0 × 13
# ℹ 13 variables: species <chr>, kingdom_ala <chr>, phylum_ala <chr>,
# class_ala <chr>, order_ala <chr>, family_ala <chr>, genus_ala <chr>,
# kingdom_gbif <chr>, phylum_gbif <chr>, class_gbif <chr>, order_gbif <chr>,
# family_gbif <chr>, genus_gbif <chr>
If we do the same for phylum and class, however, we return quite a few results. It turns out that there is a difference between the ALA and GBIF in their higher taxonomic ranks of Eucalyptus plants.
matched_names |>
filter(phylum_ala != phylum_gbif) |>
select(species, phylum_ala, phylum_gbif)# A tibble: 297 × 3
species phylum_ala phylum_gbif
<chr> <chr> <chr>
1 Eucalyptus resinifera Charophyta Tracheophyta
2 Eucalyptus mannifera Charophyta Tracheophyta
3 Eucalyptus punctata Charophyta Tracheophyta
4 Eucalyptus muelleriana Charophyta Tracheophyta
5 Eucalyptus camaldulensis Charophyta Tracheophyta
6 Eucalyptus coolabah Charophyta Tracheophyta
7 Eucalyptus largiflorens Charophyta Tracheophyta
8 Eucalyptus parramattensis Charophyta Tracheophyta
9 Eucalyptus polyanthemos Charophyta Tracheophyta
10 Eucalyptus dalrympleana Charophyta Tracheophyta
# ℹ 287 more rows
matched_names |>
filter(class_ala != class_gbif) |>
select(species, class_ala, class_gbif)# A tibble: 297 × 3
species class_ala class_gbif
<chr> <chr> <chr>
1 Eucalyptus resinifera Equisetopsida Magnoliopsida
2 Eucalyptus mannifera Equisetopsida Magnoliopsida
3 Eucalyptus punctata Equisetopsida Magnoliopsida
4 Eucalyptus muelleriana Equisetopsida Magnoliopsida
5 Eucalyptus camaldulensis Equisetopsida Magnoliopsida
6 Eucalyptus coolabah Equisetopsida Magnoliopsida
7 Eucalyptus largiflorens Equisetopsida Magnoliopsida
8 Eucalyptus parramattensis Equisetopsida Magnoliopsida
9 Eucalyptus polyanthemos Equisetopsida Magnoliopsida
10 Eucalyptus dalrympleana Equisetopsida Magnoliopsida
# ℹ 287 more rows
In GBIF, Eucalyptus sits in the phylum Tracheophyta and the class Magnoliopsida…
Code
# Use GBIF
galah_config(atlas = "gbif")
# Search for taxonomic information
gbif_taxa <- search_taxa("Eucalyptus")
# Show relevant columns
gbif_taxa |>
select(scientific_name, phylum, class, order)# A tibble: 1 × 4
scientific_name phylum class order
<chr> <chr> <chr> <chr>
1 Eucalyptus L'Hér. Tracheophyta Magnoliopsida Myrtales
…whereas in the ALA, Eucalyptus sits in the phylum Charophyta and the class Equisetopsida.
Code
# Switch to download from the ALA
galah_config(atlas = "ala")
# Search for taxonomic information
ala_taxa <- search_taxa("Eucalyptus")
# Show relevant columns
ala_taxa |>
select(scientific_name, phylum, class, order)# A tibble: 1 × 4
scientific_name phylum class order
<chr> <chr> <chr> <chr>
1 Eucalyptus Charophyta Equisetopsida Myrtales
We might not know about this issue when we first decide to match GBIF’s taxonomic names to our data. So it’s important to investigate how well these names match (and where there are mismatches) before merging them to our complete eucalypts data.
Now that we are aware of the differences between GBIF and ALA names, if we would like to use GBIF’s taxonomic names, we can join the columns with the suffix _gbif to our eucalypt occurrences data, and then replace the old taxonomic names columns with the GBIF names columns4.
eucalypts_updated_names <- matched_names |>
# select columns and join to eucalypts data
select(species, kingdom_gbif:genus_gbif) |>
right_join(eucalypts,
join_by(species == species)) |>
select(-(kingdom:genus)) |> # remove ALA taxonomic columns
rename_with( # rename columns...
~ str_remove(., "_gbif"), # ...by removing "_gbif" suffix
kingdom_gbif:genus_gbif
)
eucalypts_updated_names |>
rmarkdown::paged_table() # paged table output8.5 Detecting synonyms
Scientific discoveries and advances in our understanding of evolutionary relationships can cause changes in taxonomy. Species can be renamed, split into several species, or be placed in different genera or even families. Taxonomic synonyms are created as a result of these changes. Taxonomic synonyms refer to two or more scientific names that denote the same taxon. It can be difficult to spot synonyms in your dataset, but ignoring them can result in errors during analysis, such as artificially inflated numbers of taxa or assuming misleading relationships among taxa.
Here are some examples of synonyms.
Ranoidea caerulea is a synonym of Litoria caerulea, a species of frog. To date, R. caerulea has been the accepted name globally but L. caerulea is the preferred scientific name used in Australia. They are now both synonyms of Pelodryas caerulea, described in 20255, which has not yet been widely adopted. The genus and species returned differs between GBIF and the ALA.
galah_config(atlas = "gbif")
gbif_taxa <- search_taxa("Litoria caerulea")
gbif_taxa |>
select(scientific_name, genus, species)# A tibble: 1 × 3
scientific_name genus species
<chr> <chr> <chr>
1 Litoria caerulea (White, 1790) Ranoidea Ranoidea caerulea
galah_config(atlas = "ala")
ala_taxa <- search_taxa("Litoria caerulea")
ala_taxa |>
select(scientific_name, genus, species)# A tibble: 1 × 3
scientific_name genus species
<chr> <chr> <chr>
1 Litoria caerulea Litoria Litoria caerulea
Commersonia rosea is a synonym of Androcalva rosea, a species of mallow. The scientific name returned differs between GBIF and the ALA (ALA autocorrects this synonym whereas GBIF retains its synonym name).
galah_config(atlas = "gbif")
gbif_taxa <- search_taxa("commersonia rosea")
gbif_taxa |>
select(scientific_name, genus, species)# A tibble: 1 × 3
scientific_name genus species
<chr> <chr> <chr>
1 Commersonia rosea S.A.J.Bell & L.M.Copel. Androcalva Androcalva rosea
galah_config(atlas = "ala")
ala_taxa <- search_taxa("commersonia rosea")
ala_taxa |>
select(scientific_name, genus, species)# A tibble: 1 × 3
scientific_name genus species
<chr> <chr> <chr>
1 Androcalva rosea Androcalva Androcalva rosea
In the above examples, taxonomic searches match differently in GBIF and the ALA because each accepts a different preferred name. Using tools like search_taxa() in galah is a useful way to check whether a search returns the taxonomic information you expect.
8.5.1 Checking for synonyms
Some species lists return accepted names and synonyms. For example, here is a species list of Eucalyptus downloaded from GBIF (which we used earlier in the chapter).
Download the gbif_species_list.parquet file from the Data in this book chapter.
gbif_species_list# A tibble: 1,695 × 22
taxonKey scientificName acceptedTaxonKey acceptedScientificName
* <dbl> <chr> <dbl> <chr>
1 3176716 Eucalyptus calcicola Brooker 3176716 Eucalyptus calcicola …
2 3176802 Eucalyptus salicola Brooker 3176802 Eucalyptus salicola B…
3 3176920 Eucalyptus crebra F.Muell. 3176920 Eucalyptus crebra F.M…
4 3177269 Eucalyptus stricta Sieber e… 3177269 Eucalyptus stricta Si…
5 3717566 Eucalyptus alpina Lindl. 3717566 Eucalyptus alpina Lin…
6 8164544 Eucalyptus hemiphloia var. … 7908015 Eucalyptus albens Miq.
7 9292334 Eucalyptus goniocalyx subsp… 9292334 Eucalyptus goniocalyx…
8 11127669 Eucalyptus griffithii Maiden 11127669 Eucalyptus griffithii…
9 3176297 Eucalyptus camfieldii Maiden 3176297 Eucalyptus camfieldii…
10 3176473 Eucalyptus macrorhyncha sub… 3176473 Eucalyptus macrorhync…
# ℹ 1,685 more rows
# ℹ 18 more variables: numberOfOccurrences <dbl>, taxonRank <chr>,
# taxonomicStatus <chr>, kingdom <chr>, kingdomKey <dbl>, phylum <chr>,
# phylumKey <dbl>, class <chr>, classKey <dbl>, order <chr>, orderKey <dbl>,
# family <chr>, familyKey <dbl>, genus <chr>, genusKey <dbl>, species <chr>,
# speciesKey <dbl>, iucnRedListCategory <chr>
GBIF species lists include a taxonomicStatus column that supplies information of whether a taxonomic name is accepted or a synonym. A good example is the list of names for Eucalyptus leucoxylon, which has a number of accepted subspecies names and synonyms.
e_leucoxylon_names <- gbif_species_list |>
filter(species == "Eucalyptus leucoxylon") |>
select(species, taxonRank, taxonomicStatus, acceptedScientificName)
e_leucoxylon_names# A tibble: 18 × 4
species taxonRank taxonomicStatus acceptedScientificName
<chr> <chr> <chr> <chr>
1 Eucalyptus leucoxylon VARIETY SYNONYM Eucalyptus leucoxylon subsp…
2 Eucalyptus leucoxylon SPECIES SYNONYM Eucalyptus leucoxylon subsp…
3 Eucalyptus leucoxylon SUBSPECIES ACCEPTED Eucalyptus leucoxylon subsp…
4 Eucalyptus leucoxylon SPECIES ACCEPTED Eucalyptus leucoxylon F.Mue…
5 Eucalyptus leucoxylon SUBSPECIES ACCEPTED Eucalyptus leucoxylon subsp…
6 Eucalyptus leucoxylon VARIETY SYNONYM Eucalyptus leucoxylon F.Mue…
7 Eucalyptus leucoxylon SPECIES SYNONYM Eucalyptus leucoxylon subsp…
8 Eucalyptus leucoxylon VARIETY SYNONYM Eucalyptus leucoxylon subsp…
9 Eucalyptus leucoxylon SUBSPECIES ACCEPTED Eucalyptus leucoxylon subsp…
10 Eucalyptus leucoxylon VARIETY ACCEPTED Eucalyptus leucoxylon var. …
11 Eucalyptus leucoxylon SUBSPECIES ACCEPTED Eucalyptus leucoxylon subsp…
12 Eucalyptus leucoxylon SUBSPECIES ACCEPTED Eucalyptus leucoxylon subsp…
13 Eucalyptus leucoxylon VARIETY SYNONYM Eucalyptus leucoxylon subsp…
14 Eucalyptus leucoxylon SUBSPECIES ACCEPTED Eucalyptus leucoxylon subsp…
15 Eucalyptus leucoxylon SUBSPECIES ACCEPTED Eucalyptus leucoxylon subsp…
16 Eucalyptus leucoxylon SPECIES SYNONYM Eucalyptus leucoxylon subsp…
17 Eucalyptus leucoxylon UNRANKED ACCEPTED SH0881366.09FU
18 Eucalyptus leucoxylon VARIETY SYNONYM Eucalyptus leucoxylon subsp…
All names under species are Eucalyptus leucoxylon, and yet there are **lots* of names associated with varieties, subspecies and species. The main takeaway from this example is that some species can have many accepted names and synonyms depending on the taxonomic level you are interested in. GBIF species lists are one useful way to determine what accepted names might be suitable for your data.
For Australian data more specifically, The National Species List is the accepted national standard for checking scientific names and the recommended place to start. If you are working with plant data, the APCalign package offers tools to align and update Australian plant taxon names (see the Packages section below for more detail).
8.6 Detecting homonyms
Homonyms are identical names that are used to refer to different taxa. For example, the name Morganella is a genus of bacteria, a genus of fungi, a genus of scale insect, and a genus of fossil brachiopod from the Devonian period6!
When you search for names with search_taxa() from the galah package, you’ll receive a warning if there is a homonym issue.
search_taxa("morganella")Warning: Search returned multiple taxa due to a homonym issue.
ℹ Please provide another rank in your search to clarify taxa.
ℹ Use a `tibble` to clarify taxa, see `?search_taxa`.
✖ Homonym issue with "morganella".
# A tibble: 1 × 2
search_term issues
<chr> <chr>
1 morganella homonym
You can specify your query by providing other taxonomic ranks in a tibble. In a piped workflow, using the taxon_concept_id rather than the name will enable you to retrieve data for the correct taxon.
taxa <- search_taxa(tibble(kingdom = "Fungi", genus = "Morganella"))
taxa |> rmarkdown::paged_table()# Return record counts, grouped by species
galah_call() |>
identify(taxa$taxon_concept_id) |>
group_by(species) |>
atlas_counts()# A tibble: 2 × 2
species count
<chr> <int>
1 Morganella compacta 93
2 Morganella purpurascens 45
For more information on advanced taxonomic filtering in galah, you can read this vignette on the package website.
8.7 Packages
There are several packages available that can be used to query different taxonomic databases and check for synonyms.
Download the worms.csv file from the Data in this book chapter.
The taxize package allows users to search across many taxonomic data sources for hierarchical taxonomic information, such as species names (scientific and common), to resolve synonyms and homonyms.
Synonyms
We can match names against up to 118 data sources including GBIF, Catalogue of Life, World Register of Marine Species using gnr_resolve() and return one or more names scored by how well-matched they are to these sources.
Let’s search for any synonyms of Litoria caerulea as an example.
library(taxize)
# Resolve names
resolved <- gna_verifier(c("litoria caerulea"), capitalize = TRUE)
resolved# A tibble: 1 × 30
submittedName dataSourceId dataSourceTitleShort curation recordId entryDate
<chr> <chr> <chr> <chr> <chr> <chr>
1 litoria caerulea 1 Catalogue of Life Curated 3VMJ3 2025-08-…
# ℹ 24 more variables: sortScore <dbl>, matchedNameID <chr>, matchedName <chr>,
# matchedCardinality <dbl>, matchedCanonicalSimple <chr>,
# matchedCanonicalFull <chr>, currentRecordId <chr>, currentNameId <chr>,
# currentName <chr>, currentCardinality <dbl>, currentCanonicalSimple <chr>,
# currentCanonicalFull <chr>, taxonomicStatus <chr>, isSynonym <lgl>,
# editDistance <dbl>, stemEditDistance <dbl>, matchType <chr>,
# cardinalityScore <dbl>, infraSpecificRankScore <dbl>, …
Using the resolved name, we can search for its Taxonomic Serial Number using get_tsn(), which taxize uses to as a taxonomic identifier. Then we can search for existing synonyms by supplying the tsn to the synonyms() function.
# Retrieve synonyms
tsn <- get_tsn(resolved$matchedCanonicalSimple) # works as of 2025-10-03══ 1 queries ═══════════════
Retrieving data for taxon 'Litoria caerulea'
✔ Found: Litoria caerulea
══ Results ═════════════════
• Total: 1
• Found: 1
• Not Found: 0
synonyms(tsn) Accepted name(s) is/are 'Pelodryas caerulea'
Using tsn(s) 1099553
$`662872`
sub_tsn acc_name acc_tsn acc_author syn_author
1 662872 Pelodryas caerulea 1099553 (White, 1790) (White, 1790)
2 662872 Pelodryas caerulea 1099553 (White, 1790) (White, 1790)
3 662872 Pelodryas caerulea 1099553 (White, 1790) White, 1790
4 662872 Pelodryas caerulea 1099553 (White, 1790) Daudin, 1803
5 662872 Pelodryas caerulea 1099553 (White, 1790) (White, 1790)
6 662872 Pelodryas caerulea 1099553 (White, 1790) De Vis, 1884
7 662872 Pelodryas caerulea 1099553 (White, 1790) (De Vis, 1884)
8 662872 Pelodryas caerulea 1099553 (White, 1790) (De Vis, 1884)
9 662872 Pelodryas caerulea 1099553 (White, 1790) (White, 1790)
10 662872 Pelodryas caerulea 1099553 (White, 1790) Schneider, 1799
syn_name syn_tsn
1 Litoria caerulea 662872
2 Ranoidea caerulea 1099550
3 Rana caerulea 1099551
4 Hyla cyanea 1099552
5 Hyla caerulea 1099554
6 Hyla irrorata 1099555
7 Litoria irrorata 1099556
8 Pelodryas irrorata 1099557
9 Hyla caerulea caerulea 1106142
10 Rana austrasiae 1271210
Homonyms
If a name matches multiple names, get_tsn_() will return all matches.
# resolve morganella name
resolved <- gna_verifier("Morganella", capitalize = TRUE)
# Retrieve matches
tsn <- get_tsn_(resolved$matchedCanonicalSimple) # works as of 2025-09-19
Retrieving data for taxon 'Morganella'
tsn$Morganella
# A tibble: 8 × 4
tsn scientificName commonNames nameUsage
<chr> <chr> <chr> <chr>
1 200802 Morganella NA valid
2 200803 Morganella conspicua NA valid
3 200804 Morganella longispina NA valid
4 957632 Morganella NA valid
5 958592 Morganella morganii NA valid
6 963648 Morganella psychrotolerans NA valid
7 969527 Morganella morganii morganii NA valid
8 969528 Morganella morganii sibonii NA valid
You can then use each tsn number to return the complete classification of the taxonomic name.
# Retrieve upstream taxonomy
classification(tsn$Morganella$tsn[1],
upto = "family",
db = "itis"
)- 1
-
Indexes the first number in the
tsncolumn"200902" - 2
- Specifies database
$`200802`
name rank id
1 Animalia kingdom 202423
2 Bilateria subkingdom 914154
3 Protostomia infrakingdom 914155
4 Ecdysozoa superphylum 914158
5 Arthropoda phylum 82696
6 Hexapoda subphylum 563886
7 Insecta class 99208
8 Pterygota subclass 100500
9 Neoptera infraclass 563890
10 Acercaria superorder 1227978
11 Hemiptera order 103359
12 Sternorrhyncha suborder 109185
13 Coccoidea superfamily 1234349
14 Diaspididae family 109198
15 Morganella genus 200802
attr(,"class")
[1] "classification"
attr(,"db")
[1] "itis"
If you are using a list of many names, you can use the other names to establish taxonomic context for matching by adding with_context = TRUE to gnr_resolve(). This context reduces the chances of returning taxonomic homonyms.
# example:
list_of_names <- c("name1", "name2", "name3", ...)
resolved <- gnr_resolve(list_of_names, with_context = TRUE)The worrms package is the R interface to the World Register of Marine Species (WoRMS). When working with data from this database, the worrms R package has the ability to cross-check synonyms in their database using their taxonomic ID (AphiaID).
For example, we can return existing synonyms for Lupocyclus inaequalis by supplying its AphiaID to the wm_synonyms() function. We’ll use a subset of the WoRMS dataset, saved in the worms.csv file.
library(worrms)
marine_sp <- read_csv(here::here("worms.csv"))
marine_sp |>
filter(scientificname == "Lupocyclus inaequalis") |>
select(AphiaID, scientificname, status)# A tibble: 1 × 3
AphiaID scientificname status
<dbl> <chr> <chr>
1 208785 Lupocyclus inaequalis accepted
Our search returns a superseded synonym Goniosoma inaequale.
marine_sp |>
filter(scientificname == "Lupocyclus inaequalis") |>
pull(AphiaID) |>
wm_synonyms() |>
select(AphiaID, scientificname, status)# A tibble: 1 × 3
AphiaID scientificname status
<int> <chr> <chr>
1 453207 Goniosoma inaequale superseded combination
The APCalign package uses the Australian Plant Census (APC) and Australian Plant Name Index (APNI) to help users update species lists and match them to their established status (native/introduced) within different states/territories. Using our Eucalyptus leucoxylon example, let’s return the aligned, accepted and suggested name of the first 5 accepted scientific names in our e_leucoxylon_names.
library(APCalign)
create_taxonomic_update_lookup(
taxa = e_leucoxylon_names$acceptedScientificName[1:5]
)
===========================
=====================================================
================================================================================
# A tibble: 5 × 12
original_name aligned_name accepted_name suggested_name genus taxon_rank
<chr> <chr> <chr> <chr> <chr> <chr>
1 Eucalyptus leucoxy… Eucalyptus … Eucalyptus l… Eucalyptus le… Euca… subspecies
2 Eucalyptus leucoxy… Eucalyptus … Eucalyptus l… Eucalyptus le… Euca… subspecies
3 Eucalyptus leucoxy… Eucalyptus … Eucalyptus l… Eucalyptus le… Euca… subspecies
4 Eucalyptus leucoxy… Eucalyptus … Eucalyptus l… Eucalyptus le… Euca… species
5 Eucalyptus leucoxy… Eucalyptus … Eucalyptus l… Eucalyptus le… Euca… subspecies
# ℹ 6 more variables: taxonomic_dataset <chr>, taxonomic_status <chr>,
# scientific_name <chr>, aligned_reason <chr>, update_reason <chr>,
# number_of_collapsed_taxa <dbl>
Notice that the results are slightly different to those in gbif_species_list example that that we saw above in the Synonyms section. Specifically, several synonyms are categorised as accepted subspecies names, which more accurately aligns to the APC’s accepted taxonomic names.
APCalign provides additional functions like align_taxa() and update_taxonomy() to help update larger lists of taxonomic names and provide explanations about how names were matched.
aligned_taxa <-
APCalign::align_taxa(
original_name = e_leucoxylon_names$acceptedScientificName[1:5],
identifier = "APCalign test"
) Loading resources into memory...
===========================
=====================================================
================================================================================
...done
Checking alignments of 4 taxa
-> of these 1 names have a perfect match to a scientific name in the APC.
Alignments being sought for remaining names.
aligned_taxa |>
rmarkdown::paged_table() # paged table format8.8 Input from experts
Programmatic solutions for validating taxonomy can only go so far. To obtain a high quality species list, it’s good practice to seek validation from experts. Museums or taxonomic societies are great sources of knowledge.
Here is a list of some Australian taxonomic society groups to help validate taxonomies.
8.8.1 Australian taxonomic society groups
ALL SPECIES IN AUSTRALIA
You can contact the ALA’s helpdesk, who can answer your query or pass you to the relevant experts: support@ala.org.au.
The best first port of call for queries on species names in Australia is the National Species List, which is maintained by the Australian Government’s Australian Biological Resources Study. If the list itself cannot help, then a query to the ABRS will often quickly resolve an issue.
VERTEBRATES
- Amphibians and reptiles - Australian Herpetological Society
- Birds - Birdlife Australia
- Fish - Australian Society for Fish Biology
- Mammals - The Australian Mammal Society
INVERTEBRATES
- Arachnology - Australasian Arachnological Society
- Entomology - Australian Entomological Society
- Malacology - The Malacological Society of Australasia
- Nematology - Australasian Association of Nematologists
8.8.2 Global taxonomy
- GBIF taxonomic backbone - Uses over 100 different sources
- Integrated Taxonomic Information System, ITIS - Authoritative taxonomic information on plants, animals, fungi, and microbes
- Catalogue of Life - Global taxonomic catalogue
GBIF’s species list is quite comprehensive, and it includes the
taxonomicStatusof a name as “accepted”, “synonym”, “variety” or “doubtful”. To keep our example simpler, we are only using the accepted names.↩︎Several species names did not match to GBIF. In a complete data cleaning workflow, these should be investigated as the ALA and GBIF might use synonym names to describe the same species or subspecies.↩︎
There were some names that did not match GBIF, meaning their taxonomic columns contain
NAvalues. Be sure to either fix theseNAvalues before merging dataframes, or back-fill after merging dataframes. Otherwise, you might add missing data in your data set unintentionally!↩︎Stephen C Donnellan, Michael J Mahony, Damien Esquerré, Ian G Brennan, Luke C Price, Alan Lemmon, Emily Moriarty Lemmon, Rainer Günther, Paul Monis, Terry Bertozzi, J Scott Keogh, Glenn M Shea, Stephen J Richards, Phylogenomics informs a generic revision of the Australo-Papuan treefrogs (Anura: Pelodryadidae), Zoological Journal of the Linnean Society, Volume 204, Issue 2, June 2025, https://doi.org/10.1093/zoolinnean/zlaf015↩︎
Referred to as “the Age of Fishes”, the Devonian Period occurred ~419 to ~359 million years ago.↩︎