# packages
library(here)
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(janitor)
library(galah)
galah_config(email = "your-email-here", # ALA-registered email
username = "your-email-here", # GBIF account email
password = "your-password-here") # GBIF account password
<- galah_call() |>
birds filter(doi == "https://doi.org /10.26197/ala.d3365af7-e802-4ef6-8fee-8d067ae855d4") |>
atlas_occurrences()
<- galah_call() |>
legless_lizards filter(doi == "https://doi.org /10.26197/ala.6bea9b2e-b1d8-4547-b63d-20bd2cd89f3a") |>
atlas_occurrences()
<- arrow::read_parquet(
inverts here("path", "to", "inverts.parquet"))
<- galah_call() |>
eucalypts filter(doi == "https://doi.org /10.26197/ala.43003f6e-f8ad-45f2-bd85-d368c1b33e5d") |>
atlas_occurrences()
<- arrow::read_parquet(
gbif_species_list here("path", "to", "gbif_eucalyptus.parquet"))
8 Taxonomic validation
Taxonomic classification is in a state of constant change. Advances in methods, especially in molecular biology, have allowed researchers to describe new species more efficiently than ever before (Garraffoni et al. 2019). Modern approaches have also enabled the reclassification of organisms that were incorrectly described in the past. As new discoveries are made, taxonomies are frequently updated or amended.
This process of changing taxonomy makes working with open-source biodiversity data challenging. Views may differ within the literature or across authorities about which taxonomy is correct. In different countries, one taxonomy might better describe the native taxonomic diversity than others. Data infrastructures must also make choices about which taxonomic authorities to use, and different infrastructures inevitably make different decisions.
As a result, most taxonomic data will need checking and cleaning before analysis. You will encounter situations where the same taxon has several names (synonyms) or where the same name can refer to several entirely unrelated taxa (homonyms). These situations can be tricky to identify and therefore clean when working with taxonomic data.
While there is no perfect solution, some tips, tricks, and tools do exist. In this chapter we will go through some of these to clean taxonomic data, including ways of dealing with missing taxonomic information, and detecting synonyms and homonyms.
Cleaning taxonomic names can require a lot of changes! For every change, we recommend keeping detailed records of your modifications and your reasons for making those decisions.
8.0.1 Prerequisites
In this chapter we will use several datasets:
- Kingfisher (Alcedinidae) occurrence records from 2022 from the ALA
- Legless lizard (Pygopodidae) occurrence records from 2021-2023 from the ALA
- A subset of invertebrate occurrence records taken from the Curated Plant and Invertebrate Data for Bushfire Modelling data set, saved in the
inverts.parquet
file - Eucalyptus occurrence records from 2014 from the ALA
- Eucalyptus species list downloaded from GBIF, saved in the
gbif_species_list.parquet
file
Download the inverts.parquet
and gbif_species_list.parquet
files from the Data in this book chapter.
8.1 Preview names
One of the simplest ways to determine whether there are any immediate issues with taxonomic names is to preview a subset of the names. Most biodiversity datasets will have a field for the scientific names of taxa (e.g. scientificName
, scientific_name
), describing the lowest taxonomic level to which taxa have been identified. Looking at scientificName
in our birds
data, we can observe some characteristics of the names in this dataset, namely that:
- Records have been identified to different taxonomic ranks (family, genus, species, subspecies)
- Some names are in uppercase, others are in sentence case
- Where subgenera are included, they appear within parentheses
|>
birds distinct(scientificName) |>
print(n = 25)
# A tibble: 22 × 1
scientificName
<chr>
1 Dacelo (Dacelo) novaeguineae
2 Todiramphus (Todiramphus) sanctus
3 Ceyx azureus
4 Todiramphus (Lazulena) macleayii
5 Dacelo (Dacelo) leachii
6 Tanysiptera (Uralcyon) sylvia
7 Ceyx pusillus
8 Todiramphus (Cyanalcyon) pyrrhopygius
9 Syma torotoro
10 Todiramphus
11 ALCEDINIDAE
12 Dacelo (Dacelo) novaeguineae novaeguineae
13 Dacelo (Dacelo) leachii leachii
14 Todiramphus (Todiramphus) sanctus sanctus
15 Todiramphus (Todiramphus) chloris
16 Todiramphus (Todiramphus) sanctus vagans
17 Ceyx azureus azureus
18 Dacelo
19 Ceyx azureus diemenensis
20 Todiramphus (Lazulena) macleayii macleayii
21 Todiramphus (Lazulena) macleayii incinctus
22 Ceyx azureus ruficollaris
8.2 Name format
Different data providers might use different formats in their taxonomic names to delineate between taxonomic ranks. It doesn’t matter which format your data uses as long as it is consistent.
Example 1: Subspecies
As an example, the ALA uses "subsp."
to designate subspecies of Acacia observations in the scientific name, whereas subspecies of bird observations simply include the subspecific epithet after the specific epithet.
<- galah_call() |>
acacia_2018 identify("Acacia") |>
filter(year == 2018) |>
atlas_occurrences()
|>
acacia_2018 filter(str_detect(scientificName, "Acacia brunioides")) |>
distinct(scientificName)
# A tibble: 2 × 1
scientificName
<chr>
1 Acacia brunioides subsp. brunioides
2 Acacia brunioides
<- galah_call() |>
birds_2023 identify("alcedinidae") |>
filter(year == 2023) |>
atlas_occurrences()
|>
birds_2023 filter(str_detect(scientificName, "Dacelo")) |>
distinct(scientificName)
# A tibble: 6 × 1
scientificName
<chr>
1 Dacelo (Dacelo) novaeguineae
2 Dacelo (Dacelo) leachii
3 Dacelo (Dacelo) novaeguineae novaeguineae
4 Dacelo
5 Dacelo (Dacelo) leachii occidentalis
6 Dacelo (Dacelo) leachii leachii
Although both are correct, be sure to check your data to make sure that this naming format is consistent. Other taxonomic names (like subgenera) can differ between taxonomic groups, too.
8.3 Matching names to a species list
Many investigations rely on taxonomic lists of species or groups to identify relevant species. A common example is using lists of introduced, invasive, threatened, or sensitive species to identify records of interest.
There are several ways to filter records to match names on a species list. First, we’ll use a species list accessed via galah to filter records, which also provides additional functionality for filtering data prior to download. Then, we’ll use an external species list loaded into R to filter records.
galah
The ALA contains both national and state-based conservation status lists. For example, if we want to use the Victorian Restricted Species list, we can perform a text search for available lists using the term “victoria” with search_all(lists, "victoria")
.
<- search_all(lists, "victoria")
list_search list_search
# A tibble: 33 × 21
species_list_uid listName listType dateCreated lastUpdated lastUploaded
<chr> <chr> <chr> <chr> <chr> <chr>
1 dr1266 "2 b) Protect… LOCAL_L… 2014-07-31… 2017-02-15… 2017-02-15T…
2 dr1782 "Advisory Lis… CONSERV… 2014-10-27… 2022-03-16… 2022-03-16T…
3 dr967 "Advisory Lis… CONSERV… 2013-11-12… 2023-06-12… 2023-06-12T…
4 dr2504 "ALT Waterbug… LOCAL_L… 2015-09-08… 2016-06-14… 2016-06-14T…
5 dr2683 "Dung beetles… LOCAL_L… 2016-01-15… 2020-08-20… 2020-08-20T…
6 dr4890 "Endangered P… CONSERV… 2016-05-07… 2016-06-14… 2016-06-14T…
7 dr17134 "Endangered S… CONSERV… 2021-03-30… 2022-11-21… 2022-11-21T…
8 dr6635 "Gippsland’s … LOCAL_L… 2016-11-15… 2016-11-15… 2016-11-15T…
9 dr9802 "Great Victor… LOCAL_L… 2018-11-29… 2018-11-29… 2018-11-29T…
10 dr7749 "IBRA Great V… PROFILE 2017-06-19… 2017-07-03… 2017-07-03T…
# ℹ 23 more rows
# ℹ 15 more variables: lastMatched <chr>, username <chr>, itemCount <int>,
# region <chr>, isAuthoritative <lgl>, isInvasive <lgl>, isThreatened <lgl>,
# isBIE <lgl>, isSDS <lgl>, wkt <chr>, category <chr>, generalisation <chr>,
# authority <chr>, sdsType <chr>, looseSearch <lgl>
Filtering our results to authoritative lists only can help us find official state lists.
|>
list_search filter(isAuthoritative == TRUE)
# A tibble: 2 × 21
species_list_uid listName listType dateCreated lastUpdated lastUploaded
<chr> <chr> <chr> <chr> <chr> <chr>
1 dr655 Victoria : Con… CONSERV… 2015-04-04… 2024-06-19… 2024-06-19T…
2 dr490 Victorian Rest… SENSITI… 2013-06-23… 2024-05-30… 2024-05-30T…
# ℹ 15 more variables: lastMatched <chr>, username <chr>, itemCount <int>,
# region <chr>, isAuthoritative <lgl>, isInvasive <lgl>, isThreatened <lgl>,
# isBIE <lgl>, isSDS <lgl>, wkt <chr>, category <chr>, generalisation <chr>,
# authority <chr>, sdsType <chr>, looseSearch <lgl>
Now that we’ve found our list, we can view the contents of the list using show_values()
.
<- search_all(lists, "dr490") |>
vic_species_list show_values()
- 1
-
We are using the list ID
dr490
(specified in thespecies_list_uid
column) to make sure we return the correct list
• Showing values for 'dr490'.
vic_species_list
# A tibble: 137 × 6
id name commonName scientificName lsid dataResourceUid
<int> <chr> <chr> <chr> <chr> <chr>
1 5920169 Engaeus australis Lilly Pil… Engaeus austr… http… dr490
2 5920143 Engaeus fultoni Otway Bur… Engaeus fulto… http… dr490
3 5920250 Engaeus mallacoota Mallacoot… Engaeus malla… http… dr490
4 5920180 Engaeus phyllocercus Narracan … Engaeus phyll… http… dr490
5 5920240 Engaeus rostrogaleat… Strzeleck… Engaeus rostr… http… dr490
6 5920203 Engaeus sericatus Hairy Bur… Engaeus seric… http… dr490
7 5920217 Engaeus sternalis Warragul … Engaeus stern… http… dr490
8 5920238 Engaeus strictifrons Portland … Engaeus stric… http… dr490
9 5920170 Engaeus urostrictus Dandenong… Engaeus urost… http… dr490
10 5920214 Euastacus bidawalus East Gipp… Euastacus bid… http… dr490
# ℹ 127 more rows
We can now compare the taxa in vic_species_list
to those in our legless_lizards
dataset to identify any restricted species.
<- legless_lizards |>
legless_lizards_filtered filter(!scientificName %in% vic_species_list$scientificName)
legless_lizards_filtered
# A tibble: 2,128 × 8
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 001129f4-4824… Pygopus lepid… https://biodi… -34.0 151.
2 0031c737-922a… Pygopus lepid… https://biodi… -36.0 150.
3 005dfdd2-4a93… Lialis burton… https://biodi… -29.1 152.
4 0063af2c-e070… Pygopus lepid… https://biodi… -34.9 139.
5 0081dbcb-6af7… Delma impar https://biodi… -36.3 149.
6 00a9ffcd-ec03… Lialis burton… https://biodi… -27.5 153.
7 00dc4542-426a… Aprasia pseud… https://biodi… -34.7 139.
8 010eb86a-7bd4… Lialis burton… https://biodi… -30.2 153.
9 013cb1b9-dc93… Aprasia strio… https://biodi… -35.0 118.
10 0157207c-3a91… Pygopus lepid… https://biodi… -33.7 150.
# ℹ 2,118 more rows
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>
This process has removed more than 140 records from our dataset.
nrow(legless_lizards) - nrow(legless_lizards_filtered)
[1] 149
We can also filter our queries prior to downloading data in galah by adding a filter specifying species_list_uid == dr490
to our query.
galah_call() |>
identify("Pygopodidae") |>
filter(species_list_uid == dr490) |>
group_by(species) |>
atlas_counts()
- 1
-
We are using the list ID
dr490
(specified in thespecies_list_uid
column) to make sure we return the correct list
# A tibble: 2 × 2
species count
<chr> <int>
1 Aprasia parapulchella 690
2 Aprasia aurita 103
Using an external list
We can also use lists downloaded outside of galah to filter our data. As an example, let’s filter our taxonomic names to include only Australian names from the Global Register of Introduced and Invasive Species (GRIIS). After downloading this list and saving it in your working directory, we can read the list into R. Taxonomic names are stored in columns with an accepted_name
prefix.
<- read_csv(here("griis_australia_20240712.csv"))
griis
glimpse(griis)
- 1
-
We renamed the downloaded file from
record 20240712-155356.csv
togriis_australia_20240712.csv
Rows: 2,979
Columns: 16
$ scientific_name <chr> "Oenothera longiflora L.", "Lampranth…
$ scientific_name_type <chr> "species", "species", "species", "spe…
$ kingdom <chr> "Plantae", "Plantae", "Plantae", "Pla…
$ establishment_means <chr> "alien", "alien", "alien", "alien", "…
$ is_invasive <chr> "null", "null", "null", "null", "null…
$ occurrence_status <chr> "present", "present", "present", "pre…
$ checklist.name <chr> "Australia", "Australia", "Australia"…
$ checklist.iso_countrycode_alpha3 <chr> "AUS", "AUS", "AUS", "AUS", "AUS", "A…
$ accepted_name.species <chr> "Oenothera longiflora", "Lampranthus …
$ accepted_name.kingdom <chr> "Plantae", "Plantae", "Plantae", "Pla…
$ accepted_name.phylum <chr> "Tracheophyta", "Tracheophyta", "Trac…
$ accepted_name.class <chr> "Magnoliopsida", "Magnoliopsida", "Ma…
$ accepted_name.order <chr> "Myrtales", "Caryophyllales", "Erical…
$ accepted_name.family <chr> "Onagraceae", "Aizoaceae", "Ericaceae…
$ accepted_name.habitat <chr> "[\"terrestrial\"]", "[\"terrestrial\…
$ accepted_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Now we can check which species names in our legless_lizards
dataset match names in griis
.
# Check which species matched the GRIIS list
<- legless_lizards |>
matches filter(scientificName %in% griis$accepted_name.species)
matches
# A tibble: 0 × 8
# ℹ 8 variables: recordID <chr>, scientificName <chr>, taxonConceptID <chr>,
# decimalLatitude <dbl>, decimalLongitude <dbl>, eventDate <dttm>,
# occurrenceStatus <chr>, dataResourceName <chr>
After reviewing the matches and confirming we’re happy with the list of matched species, we can exclude these taxa from our data by removing the identified rows.
<- legless_lizards |>
legless_lizards_filtered filter(!scientificName %in% matches)
legless_lizards_filtered
# A tibble: 2,277 × 8
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 001129f4-4824… Pygopus lepid… https://biodi… -34.0 151.
2 0027660b-3e75… Aprasia parap… https://biodi… -35.4 149
3 0031c737-922a… Pygopus lepid… https://biodi… -36.0 150.
4 005dfdd2-4a93… Lialis burton… https://biodi… -29.1 152.
5 0063af2c-e070… Pygopus lepid… https://biodi… -34.9 139.
6 0081dbcb-6af7… Delma impar https://biodi… -36.3 149.
7 00a9ffcd-ec03… Lialis burton… https://biodi… -27.5 153.
8 00dc4542-426a… Aprasia pseud… https://biodi… -34.7 139.
9 010eb86a-7bd4… Lialis burton… https://biodi… -30.2 153.
10 013cb1b9-dc93… Aprasia strio… https://biodi… -35.0 118.
# ℹ 2,267 more rows
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>
You can apply this concept of filtering to any list of species, or other fields, that you would like to exclude.
8.4 Taxonomic names matching
8.4.1 Missing higher taxonomic information
It’s not uncommon to receive data that are missing information at some taxonomic levels, but this can make it tricky to summarise data or create visualisations based on taxonomy later on.
As an example, here is a small sample of our inverts
dataset. You’ll notice that we only have information on scientific_name
, class
, and family
.
<- inverts |>
inverts_sample slice(1234:1271)
|> print(n = 5) inverts_sample
# A tibble: 38 × 9
record_id scientific_name class family year latitude longitude sensitive
<chr> <chr> <chr> <chr> <int> <dbl> <dbl> <int>
1 76213a64-ed41… Helicotylenchu… chro… hoplo… NA -23.1 151. 0
2 e74ec2f0-4cef… Iravadia (Irav… gast… irava… 1903 -16.5 140. 0
3 340c2b82-6b85… Monomorium bic… inse… formi… 1998 -24.7 150. 0
4 e7dc1fa1-6524… Saprosites men… inse… scara… 2004 -43.1 147. 0
5 316ad303-efc6… Amitermes darw… inse… termi… 1953 -21.9 118. 0
# ℹ 33 more rows
# ℹ 1 more variable: project <chr>
One way to fill in values at the missing taxonomic levels (e.g. phylum, order) is to get this information from a data infrastructure like the ALA, which has its own taxonomic backbone. We’ll start by extracting the scientific names of taxa in inverts_sample
and saving these as taxa_sample_names
.
<- inverts_sample |>
taxa_sample_names select(scientific_name) |>
distinct() |>
pull()
1:5] # first 5 names taxa_sample_names[
[1] "Helicotylenchus multicinctus" "Iravadia (Iravadia) carpentariensis"
[3] "Monomorium bicorne" "Saprosites mendax"
[5] "Amitermes darwini"
We can then search for those names in the ALA using using search_taxa()
from galah. We’ll save the results in names_matches_ala
. The results contain complete taxonomic information from kingdom
to species
.
Anytime you search for taxonomic matches using names, it’s good practice to double check the urls returned in taxon_concept_id
to make sure your results match the names you expected!
<- search_taxa(taxa_sample_names)
names_matches_ala names_matches_ala
# A tibble: 38 × 15
search_term scientific_name scientific_name_auth…¹ taxon_concept_id rank
<chr> <chr> <chr> <chr> <chr>
1 Helicotylenchu… Helicotylenchu… (Cobb, 1893) https://biodive… spec…
2 Iravadia (Irav… Iravadia (Irav… (Hedley, 1912) https://biodive… spec…
3 Monomorium bic… Chelaner bicor… (Forel, 1907) https://biodive… spec…
4 Saprosites men… Saprosites men… (Blackburn, 1892) https://biodive… spec…
5 Amitermes darw… Amitermes darw… (Hill, 1922) https://biodive… spec…
6 Schedorhinoter… Schedorhinoter… (Hill, 1933) https://biodive… spec…
7 Sorama bicolor Sorama bicolor Walker, 1855 https://biodive… spec…
8 Windbalea warr… Windbalea warr… Rentz, 1993 https://biodive… spec…
9 Tholymis tilla… Tholymis tilla… (Fabricius, 1798) https://biodive… spec…
10 Costellipitar … Costellipitar … (Hedley, 1923) https://biodive… spec…
# ℹ 28 more rows
# ℹ abbreviated name: ¹scientific_name_authorship
# ℹ 10 more variables: match_type <chr>, kingdom <chr>, phylum <chr>,
# class <chr>, order <chr>, family <chr>, genus <chr>, species <chr>,
# vernacular_name <chr>, issues <chr>
Now we can merge this information into our inverts_sample
dataset.
First, let’s select the columns from names_matches_ala
that we want, and rename those so we can differentiate between the columns in inverts_sample
and the ones we just downloaded using galah. We’ll suffix the columns in names_matches_ala
with "_ala"
.
<- names_matches_ala |>
names_matches_renamed select(scientific_name, kingdom:species) |>
rename_with(\(column_name) paste0(column_name, "_ala"),
:species)
kingdom names_matches_renamed
- 1
-
This line uses shorthand to write a function to append a suffix to a column name. An equivalent way of writing this is:
function(column_name) {paste0(column_name, "_ala)}
This is applied to each column name fromkingdom
tospecies
in thenames_matches_ala
dataframe.
# A tibble: 38 × 8
scientific_name kingdom_ala phylum_ala class_ala order_ala family_ala
<chr> <chr> <chr> <chr> <chr> <chr>
1 Helicotylenchus multic… Animalia Nematoda Chromado… Panagrol… Hoplolaim…
2 Iravadia (Iravadia) ca… Animalia Mollusca Gastropo… Hypsogas… Iravadiid…
3 Chelaner bicorne Animalia Arthropoda Insecta Hymenopt… Formicidae
4 Saprosites mendax Animalia Arthropoda Insecta Coleopte… Scarabaei…
5 Amitermes darwini Animalia Arthropoda Insecta Blattodea Termitidae
6 Schedorhinotermes actu… Animalia Arthropoda Insecta Blattodea Rhinoterm…
7 Sorama bicolor Animalia Arthropoda Insecta Lepidopt… Notodonti…
8 Windbalea warrooa Animalia Arthropoda Insecta Orthopte… Tettigoni…
9 Tholymis tillarga Animalia Arthropoda Insecta Odonata Libelluli…
10 Costellipitar inconsta… Animalia Mollusca Bivalvia Cardiida Veneridae
# ℹ 28 more rows
# ℹ 2 more variables: genus_ala <chr>, species_ala <chr>
Now let’s join our matched names in names_matches_renamed
to our inverts_sample
data. This adds all higher taxonomic names columns to our inverts_sample
data.
<- names_matches_renamed |>
inverts_sample_with_ranks right_join(inverts_sample,
join_by(scientific_name == scientific_name))
inverts_sample_with_ranks
# A tibble: 38 × 16
scientific_name kingdom_ala phylum_ala class_ala order_ala family_ala
<chr> <chr> <chr> <chr> <chr> <chr>
1 Helicotylenchus multic… Animalia Nematoda Chromado… Panagrol… Hoplolaim…
2 Iravadia (Iravadia) ca… Animalia Mollusca Gastropo… Hypsogas… Iravadiid…
3 Saprosites mendax Animalia Arthropoda Insecta Coleopte… Scarabaei…
4 Amitermes darwini Animalia Arthropoda Insecta Blattodea Termitidae
5 Schedorhinotermes actu… Animalia Arthropoda Insecta Blattodea Rhinoterm…
6 Sorama bicolor Animalia Arthropoda Insecta Lepidopt… Notodonti…
7 Windbalea warrooa Animalia Arthropoda Insecta Orthopte… Tettigoni…
8 Tholymis tillarga Animalia Arthropoda Insecta Odonata Libelluli…
9 Costellipitar inconsta… Animalia Mollusca Bivalvia Cardiida Veneridae
10 Placamen lamellosum Animalia Mollusca Bivalvia Cardiida Veneridae
# ℹ 28 more rows
# ℹ 10 more variables: genus_ala <chr>, species_ala <chr>, record_id <chr>,
# class <chr>, family <chr>, year <int>, latitude <dbl>, longitude <dbl>,
# sensitive <int>, project <chr>
We can verify the join worked correctly by checking that names in the original family
column are identical to those in the new family_ala
column. If there were mismatches, the join would produce more rows than initially occurred in inverts_sample
: rows not matching to a scientific name would have returned columns with NA
values, which would not join to those in inverts_sample
.
To double check that our join worked correctly by making sure names in our original family
column all match our new family_ala
column. If the join did not work correctly, we would expect many rows to be returned because there would be NA
values in any rows that didn’t match a scientific_name
.
Nothing is returned, meaning the names in family_ala
and family
all match and our join worked correctly!
|>
inverts_sample_with_ranks select(scientific_name, family_ala, family) |>
mutate(family = stringr::str_to_sentence(family)) |> # match formatting
filter(family_ala != family)
# A tibble: 0 × 3
# ℹ 3 variables: scientific_name <chr>, family_ala <chr>, family <chr>
8.4.2 Identifying mismatches in species lists
Higher taxonomy from different data providers may not always match. If this is the case, you will need to back-fill the higher taxonomic ranks using data from your preferred taxonomic naming authority.
Let’s use data of Eucalyptus observations we downloaded from the ALA as an example.
eucalypts
# A tibble: 8,476 × 16
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 0009ba6a-8e8e… Eucalyptus re… https://id.bi… -17.6 145.
2 002b74ab-b8ce… Eucalyptus ca… https://id.bi… -34.2 141.
3 002bde6c-3a7f… Eucalyptus co… https://id.bi… -30.1 146.
4 002cb2ce-c8a1… Eucalyptus ca… https://id.bi… -37.1 141.
5 0031022c-8e9e… Eucalyptus la… https://id.bi… -34.4 142.
6 00407506-383e… Eucalyptus pa… https://id.bi… -34.1 151.
7 004413ca-5a95… Eucalyptus po… https://id.bi… -35.3 149.
8 005371a8-047e… Eucalyptus ca… https://id.bi… -35.7 145.
9 00560db1-bb66… Eucalyptus da… https://id.bi… -36.3 148.
10 005fcf1f-3c6f… Eucalyptus no… https://id.bi… -30.4 152.
# ℹ 8,466 more rows
# ℹ 11 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, kingdom <chr>, phylum <chr>, class <chr>,
# order <chr>, family <chr>, genus <chr>, species <chr>, taxonRank <chr>
This occurrence data contains observations of over 300 species.
|>
eucalypts filter(taxonRank != "genus") |>
distinct(species) |>
count(name = "n_species")
# A tibble: 1 × 1
n_species
<int>
1 311
Let’s say we want to compare these observations to data retrieved outside of the ALA and decide that we’d prefer to use GBIF’s1 taxonomy. ALA data uses its own taxonomic backbone that differs to GBIF’s (depending on the taxonomic group), so we will need to amend our taxonomic names to match GBIF’s.
Let’s go through the steps to match our taxonomy in our eucalypts
data to GBIF’s taxonomy. We can download a species list of Eucalyptus from GBIF. This list returns nearly 1,700 species names.
Download the gbif_species_list.parquet
file from the Data in this book chapter.
gbif_species_list
# A tibble: 1,695 × 22
taxonKey scientificName acceptedTaxonKey acceptedScientificName
* <dbl> <chr> <dbl> <chr>
1 3176716 Eucalyptus calcicola Brooker 3176716 Eucalyptus calcicola …
2 3176802 Eucalyptus salicola Brooker 3176802 Eucalyptus salicola B…
3 3176920 Eucalyptus crebra F.Muell. 3176920 Eucalyptus crebra F.M…
4 3177269 Eucalyptus stricta Sieber e… 3177269 Eucalyptus stricta Si…
5 3717566 Eucalyptus alpina Lindl. 3717566 Eucalyptus alpina Lin…
6 8164544 Eucalyptus hemiphloia var. … 7908015 Eucalyptus albens Miq.
7 9292334 Eucalyptus goniocalyx subsp… 9292334 Eucalyptus goniocalyx…
8 11127669 Eucalyptus griffithii Maiden 11127669 Eucalyptus griffithii…
9 3176297 Eucalyptus camfieldii Maiden 3176297 Eucalyptus camfieldii…
10 3176473 Eucalyptus macrorhyncha sub… 3176473 Eucalyptus macrorhync…
# ℹ 1,685 more rows
# ℹ 18 more variables: numberOfOccurrences <dbl>, taxonRank <chr>,
# taxonomicStatus <chr>, kingdom <chr>, kingdomKey <dbl>, phylum <chr>,
# phylumKey <dbl>, class <chr>, classKey <dbl>, order <chr>, orderKey <dbl>,
# family <chr>, familyKey <dbl>, genus <chr>, genusKey <dbl>, species <chr>,
# speciesKey <dbl>, iucnRedListCategory <chr>
To investigate whether the complete taxonomy—from kingdom to species—matches between our ALA data and GBIF species list, let’s get the columns with taxonomic information from our eucalypts
dataframe and our gbif_species_list
to compare.
First, we can select columns containing taxonomic names in our ALA eucalypts
dataframe (kingdom
to species
) and use distinct()
to remove duplicate rows. This will leave us with one row for each distinct species in our dataset (very similar to a species list).
<- eucalypts |>
ala_names select(kingdom:species) |>
distinct()
ala_names
# A tibble: 312 × 7
kingdom phylum class order family genus species
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus re…
2 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus ca…
3 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus co…
4 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus la…
5 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus pa…
6 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus po…
7 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus da…
8 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus no…
9 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus pl…
10 Plantae Charophyta Equisetopsida Myrtales Myrtaceae Eucalyptus Eucalyptus me…
# ℹ 302 more rows
Now let’s filter gbif_species_list
to only “accepted” names2 and select the same taxonomic names columns.
<- gbif_species_list |>
gbif_names filter(taxonomicStatus == "ACCEPTED") |> # accepted names
select(kingdom:species) |>
select(!contains("Key")) |> # remove Key columns
distinct()
gbif_names
- 1
-
We added
distinct()
to remove duplicate rows of species names. These duplicates appear because there might be multiple subspecies under the same species name. For example, Eucalyptus mannifera has 4 subspecies; Eucalyptus wimmerensis has 5. We aren’t interested in identifying species at that level, and so we remove these duplicates to simplify our species list.
# A tibble: 989 × 7
kingdom phylum class order family genus species
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
2 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
3 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
4 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
5 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
6 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
7 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
8 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
9 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
10 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Eucalyptus Eucalyptus …
# ℹ 979 more rows
We can merge our two names data frames together, matching by species name, which will allow us to compare them. We’ll distinguish which columns came from each data frame by appending an "_ala"
or "_gbif"
suffix to each column name.
<- ala_names |>
matched_names left_join(gbif_names,
join_by(species == species),
suffix = c("_ala", "_gbif")) |>
select(species, everything()) # reorder columns
matched_names
now contains the full taxonomy from the ALA and GBIF for all matched species3.
::paged_table( # print paged table
rmarkdown
matched_names )
We are now ready to compare taxonomic names to find mismatches. We can start by finding any species with a mismatch in their kingdom name by filtering to return rows where kingdom_ala
and kingdom_gbif
are not equal. Our returned tibble is empty, meaning there were no mismatches.
|>
matched_names filter(kingdom_ala != kingdom_gbif)
# A tibble: 0 × 13
# ℹ 13 variables: species <chr>, kingdom_ala <chr>, phylum_ala <chr>,
# class_ala <chr>, order_ala <chr>, family_ala <chr>, genus_ala <chr>,
# kingdom_gbif <chr>, phylum_gbif <chr>, class_gbif <chr>, order_gbif <chr>,
# family_gbif <chr>, genus_gbif <chr>
If we do the same for phylum and class, however, we return quite a few results. It turns out that there is a difference between the ALA and GBIF in their higher taxonomic ranks of Eucalyptus plants.
|>
matched_names filter(phylum_ala != phylum_gbif) |>
select(species, phylum_ala, phylum_gbif)
# A tibble: 303 × 3
species phylum_ala phylum_gbif
<chr> <chr> <chr>
1 Eucalyptus resinifera Charophyta Tracheophyta
2 Eucalyptus camaldulensis Charophyta Tracheophyta
3 Eucalyptus coolabah Charophyta Tracheophyta
4 Eucalyptus largiflorens Charophyta Tracheophyta
5 Eucalyptus parramattensis Charophyta Tracheophyta
6 Eucalyptus polyanthemos Charophyta Tracheophyta
7 Eucalyptus dalrympleana Charophyta Tracheophyta
8 Eucalyptus nobilis Charophyta Tracheophyta
9 Eucalyptus planchoniana Charophyta Tracheophyta
10 Eucalyptus melliodora Charophyta Tracheophyta
# ℹ 293 more rows
|>
matched_names filter(class_ala != class_gbif) |>
select(species, class_ala, class_gbif)
# A tibble: 303 × 3
species class_ala class_gbif
<chr> <chr> <chr>
1 Eucalyptus resinifera Equisetopsida Magnoliopsida
2 Eucalyptus camaldulensis Equisetopsida Magnoliopsida
3 Eucalyptus coolabah Equisetopsida Magnoliopsida
4 Eucalyptus largiflorens Equisetopsida Magnoliopsida
5 Eucalyptus parramattensis Equisetopsida Magnoliopsida
6 Eucalyptus polyanthemos Equisetopsida Magnoliopsida
7 Eucalyptus dalrympleana Equisetopsida Magnoliopsida
8 Eucalyptus nobilis Equisetopsida Magnoliopsida
9 Eucalyptus planchoniana Equisetopsida Magnoliopsida
10 Eucalyptus melliodora Equisetopsida Magnoliopsida
# ℹ 293 more rows
In GBIF, Eucalyptus sits in the phylum Tracheophyta and the class Magnoliopsida…
Code
# Use GBIF
galah_config(atlas = "gbif")
# Search for taxonomic information
<- search_taxa("eucalyptus")
gbif_taxa
# Show relevant columns
|>
gbif_taxa select(scientific_name, phylum, class, order)
# A tibble: 1 × 4
scientific_name phylum class order
<chr> <chr> <chr> <chr>
1 Eucalyptus L'Hér. Tracheophyta Magnoliopsida Myrtales
…whereas in the ALA, Eucalyptus sits in the phylum Charophyta and the class Equisetopsida.
Code
# Switch to download from the ALA
galah_config(atlas = "ala")
# Search for taxonomic information
<- search_taxa("Eucalyptus")
ala_taxa
# Show relevant columns
|>
ala_taxa select(scientific_name, phylum, class, order)
# A tibble: 1 × 4
scientific_name phylum class order
<chr> <chr> <chr> <chr>
1 Eucalyptus Charophyta Equisetopsida Myrtales
We might not know about this issue when we first decide to match GBIF’s taxonomic names to our data. So it’s important to investigate how well these names match (and where there are any mismatches) before merging them to our complete eucalypts
data.
Now that we are aware of the differences between GBIF and ALA names, if we would like to use GBIF’s taxonomic names, we can join the columns with the suffix _gbif
to our eucalypt
occurrences data, and then replace the old taxonomic names columns with the GBIF names columns4.
<- matched_names |>
eucalypts_updated_names # select columns and join to eucalypts data
select(species, kingdom_gbif:genus_gbif) |>
right_join(eucalypts,
join_by(species == species)) |>
select(-(kingdom:genus)) |> # remove ALA taxonomic columns
rename_with( # rename columns...
~ str_remove(., "_gbif"), # ...by removing "_gbif" suffix
:genus_gbif
kingdom_gbif
)
|>
eucalypts_updated_names ::paged_table() # paged table output rmarkdown