10  Geospatial cleaning

Geospatial observational data provide essential information about species’ locations over time and space, and can be combined with ecological data to understand species-environment interactions. However, working with geospatial data can be challenging, as seemingly minor issues can significantly impact data validity.

Outliers—data points that are considerably distant from the majority of a species’ observations and can skew overall distribution—are a major challenge in geospatial data cleaning. Identifying outliers can be difficult because it’s not always clear whether they are true outliers or data errors. Errors can result from species misidentification or incorrect geo-referencing. Other accidental errors, such as reversing numeric symbols, mistyping coordinates, or entering incorrect locations, can dramatically affect the reported location of a species. For species with smaller ranges, these errors may be easier to detect. However, for species with larger ranges or analyses involving many species over a large area, these errors become much more difficult to identify.

Every dataset has its own combination of issues requiring bespoke cleaning methods (e.g. (Jin and Yang 2020)). It is crucial to clean geospatial data effectively to ensure their usefulness, as errors can lead to unexpected results in species range estimates and analytic outputs.

In this chapter, we will highlight common issues with coordinate data and demonstrate how to correct or remove suspicious-seeming records.

Checklist

This chapter can be read more like a checklist of possible geospatial errors in a dataset, how to identify them, and how to fix them.

10.0.1 Prerequisites

In this chapter we’ll use several datasets:

  • MacDonnell’s desert fuschia (Eremophila macdonnellii) occurrence records from the ALA
  • Red-eyed tree frog (Litoria chloris) occurrence records in 2013 from the ALA
  • Kowari (Dasyuroides byrnei, a native mouse) occurrence records from the ALA
  • Acacia occurrence records from the ALA
  • Common brown butterfly (Heteronympha merope) occurrence records in 2014 from the ALA
  • Bitter pea (Daviesia ulicifolia) occurrence records from the ALA
# packages
library(galah)
library(ggplot2)
library(dplyr)
library(sf)
library(ozmaps)
library(tidyr)
library(stringr)
galah_config(email = "your-email-here") # ALA-registered email

desert_plant <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.96e26768-a725-490f-a4cf-fb92919e16fe") |>
  atlas_occurrences()

frogs <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.625b3655-9fb7-4b0e-acff-30dc820c2272") |>
  atlas_occurrences()

native_mice <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.4af234e0-4cec-4720-917a-f12ccfb83c4f") |>
  atlas_occurrences()

acacias <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.19951ce0-9f3f-4692-b079-02f4f0fd0a6d") |>
  atlas_occurrences()

butterflies <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.b28aaf66-4d18-41f6-8e84-e420656923c9") |>
  atlas_occurrences()

bitter_peas <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.44089ead-b5da-41c7-bdbb-761aec1c8825") |>
  atlas_occurrences()

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

desert_plant <- galah_call() |>
  identify("Eremophila macdonnellii") |>
  select(group = "basic", 
         PRESUMED_SWAPPED_COORDINATE) |> # add assertion column
  atlas_occurrences()

frogs <- galah_call() |>
  identify("Litoria chloris") |>
  filter(year == 2013) |>
  select(group = "basic",
         countryCode, locality,
         family, genus, species, 
         cl22, eventDate) |>
  atlas_occurrences()

native_mice <- galah_call() |>
  identify("Dasyuroides byrnei") |>
  select(scientificName, decimalLongitude, decimalLatitude,
         eventDate,
         country, countryCode, locality, 
         COUNTRY_COORDINATE_MISMATCH,
         group = "assertions") |>
  atlas_occurrences()

acacias <- galah_call() |>
  identify("acacia aneura") |>
  select(group = "basic",
         ZERO_COORDINATE, # add assertion column
         countryCode, locality) |>
  atlas_occurrences()

butterflies <- galah_call() |>
  identify("Heteronympha merope") |>
  filter(year == 2014,
         decimalLatitude < 0) |>
  select(group = "basic",
         COORDINATES_CENTRE_OF_COUNTRY, # add assertion column
         COORDINATES_CENTRE_OF_STATEPROVINCE, # add assertion column
         countryCode, locality) |>
  atlas_occurrences()

bitter_peas <- galah_call() |>
  identify("Daviesia ulicifolia") |>
  atlas_occurrences()
1
We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

10.1 Missing coordinates

As discussed in Missing Values chapter, many spatial analytical tools are not compatible with missing coordinate data. We recommend identifying the rows that have missing data before deciding to exclude them.

# Identify missing data in coordinates
desert_plant |> 
  filter(is.na(decimalLatitude) | is.na (decimalLongitude))
# A tibble: 74 × 9
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 000d4874-8c74… Eremophila ma… https://id.bi…              NA               NA
 2 050653b6-a41c… Eremophila ma… https://id.bi…              NA               NA
 3 06e69581-7d6d… Eremophila ma… https://id.bi…              NA               NA
 4 0eead38f-0c16… Eremophila ma… https://id.bi…              NA               NA
 5 0f52e34b-a803… Eremophila ma… https://id.bi…              NA               NA
 6 1190b3b9-90d8… Eremophila ma… https://id.bi…              NA               NA
 7 18d11ae3-e558… Eremophila ma… https://id.bi…              NA               NA
 8 19426c52-9d49… Eremophila ma… https://id.bi…              NA               NA
 9 205d432e-c6bc… Eremophila ma… https://id.bi…              NA               NA
10 2ab6846b-00cb… Eremophila ma… https://id.bi…              NA               NA
# ℹ 64 more rows
# ℹ 4 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, PRESUMED_SWAPPED_COORDINATE <lgl>

You can use drop_na() to remove missing values from your dataset.

# Excluding them
desert_plant <- desert_plant |> 
  tidyr::drop_na(decimalLatitude, decimalLongitude)
1
You could also use filter(!is.na(decimalLatitude), !is.na(decimalLongitude)) to achieve the same thing

10.2 Correcting fixable coordinate errors

Spatial outliers can sometimes result from taxonomic misidentification, but not always. Occasionally, records that appear as outliers are true observations of a species but contain mistakes in their coordinates. To avoid unnecessarily deleting data, it’s good practice to use multiple sources of spatial information to decide whether an unexpected data point is due to a small but fixable error in coordinates.

Many coordinate issues can be solved through data manipulation rather than discarding the data. Here are several coordinate issues that can be identified and corrected.

10.2.1 Swapped numeric sign

If you notice a cluster of points mirrored in the opposite hemisphere, consider correcting the sign instead of discarding the points.

Let’s use MacDonnell’s desert fuschia occurrence records for our example. Including the PRESUMED_SWAPPED_COORDINATE assertion column when downloading records using the galah package allows us to identify records flagged as potentially having swapped coordinates.

desert_plant <- desert_plant |>
  drop_na(decimalLongitude, decimalLatitude) # remove NA coordinates

desert_plant |>
  select(PRESUMED_SWAPPED_COORDINATE, everything())
# A tibble: 890 × 9
   PRESUMED_SWAPPED_COO…¹ recordID scientificName taxonConceptID decimalLatitude
   <lgl>                  <chr>    <chr>          <chr>                    <dbl>
 1 FALSE                  0009009… Eremophila ma… https://id.bi…           -22.8
 2 FALSE                  002e372… Eremophila ma… https://id.bi…           -25.3
 3 FALSE                  0034dc0… Eremophila ma… https://id.bi…           -23.9
 4 FALSE                  0063223… Eremophila ma… https://id.bi…           -24.5
 5 FALSE                  00d15a5… Eremophila ma… https://id.bi…           -27.8
 6 FALSE                  013049a… Eremophila ma… https://id.bi…           -25.2
 7 FALSE                  015571a… Eremophila ma… https://id.bi…           -22.1
 8 FALSE                  01b5e44… Eremophila ma… https://id.bi…           -25.2
 9 FALSE                  02524e1… Eremophila ma… https://id.bi…           -25.1
10 FALSE                  026c225… Eremophila ma… https://id.bi…           -25.8
# ℹ 880 more rows
# ℹ abbreviated name: ¹​PRESUMED_SWAPPED_COORDINATE
# ℹ 4 more variables: decimalLongitude <dbl>, eventDate <dttm>,
#   occurrenceStatus <chr>, dataResourceName <chr>

If we plot these records on a map and colour the points based on values in the PRESUMED_SWAPPED_COORDINATE assertion column, we can see that there is a single record (in orange) that looks like its coordinates have been mirrored across hemispheres.

# Retrieve map of Australia
aus <- st_transform(ozmap_country, 4326)

ggplot() + 
  geom_sf(data = aus) +
  geom_point(data = desert_plant,
             aes(x = decimalLongitude, 
                 y = decimalLatitude,
                 colour = PRESUMED_SWAPPED_COORDINATE)) + 
  pilot::scale_color_pilot()

We can correct the numeric signs using if_else() from dplyr. The first statement updates our decimalLongitude column so that when decimalLongitude is less than 0, we remove the negative symbol by multiplying by -1, otherwise we keep the original longitude value. The second statement updates our decimalLatitude column using the same process.

desert_plant_filtered <- desert_plant |>
  mutate(
    decimalLongitude = if_else(decimalLongitude < 0,
      decimalLongitude * -1,
      decimalLongitude
    ),
    decimalLatitude = if_else(decimalLatitude > 0,
      decimalLatitude * -1,
      decimalLatitude
    )
  )

And here’s the updated map, with the corrected coordinates.

ggplot() + 
  geom_sf(data = aus) +
  geom_point(data = desert_plant_filtered,
             aes(x = decimalLongitude, 
                 y = decimalLatitude,
                 colour = PRESUMED_SWAPPED_COORDINATE)) + 
  pilot::scale_color_pilot()

10.2.2 Location description doesn’t match coordinates

Misalignment between location metadata and coordinates could indicate errors in the dataset, but it’s sometimes possible to rectify these. Let’s use red-eyed tree frog data as an example.

frogs <- frogs |>
  drop_na(decimalLatitude, decimalLongitude) # remove NA values

frogs
# A tibble: 30 × 14
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 0bbcdc4a-5638… Litoria chlor… https://biodi…           -28.4             153.
 2 16ecde94-6b9b… Litoria chlor… https://biodi…           -28.4             153.
 3 22b115c9-f799… Litoria chlor… https://biodi…           -28.2             153.
 4 236bda61-799f… Litoria chlor… https://biodi…           -20.3             149.
 5 2ba9c818-e81a… Litoria chlor… https://biodi…           -27.4             152.
 6 4a1b77fa-3538… Litoria chlor… https://biodi…           -30.1             153.
 7 4d3274a4-f9cd… Litoria chlor… https://biodi…           -27.4             153.
 8 4e87c52b-2b80… Litoria chlor… https://biodi…           -28.2             153.
 9 52c1043c-5f79… Litoria chlor… https://biodi…           -29.7             152.
10 52d5551a-816d… Litoria chlor… https://biodi…           -29.9             153.
# ℹ 20 more rows
# ℹ 9 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, countryCode <chr>, locality <chr>, family <chr>,
#   genus <chr>, species <chr>, cl22 <chr>

When we plot the coordinates of our red-eyed tree frog occurrences, there is an unexpected observation near Japan (or where Japan would appear if we had plotted more countries and not just Australia). This is quite surprising—red-eyed tree frogs are not native to Japan!

# Get a map of aus, transform projection
aus <- st_transform(ozmap_country, 4326)

# Map
ggplot() +
  geom_sf(data = aus,
          colour = "grey60") +
  geom_point(data = frogs,
             aes(x = decimalLongitude,
                 y = decimalLatitude),
             colour = "#557755")

Let’s check the countryCode column to see whether this might be an Australian record with a mistake in the coordinates. Using distinct(), we can see that there are 2 country codes…

frogs |>
  distinct(countryCode)
# A tibble: 2 × 1
  countryCode
  <chr>      
1 AU         
2 JP         

…and filtering to Japan ("JP") identifies our stray data point.

frogs |>
  filter(countryCode == "JP")
# A tibble: 1 × 14
  recordID        scientificName taxonConceptID decimalLatitude decimalLongitude
  <chr>           <chr>          <chr>                    <dbl>            <dbl>
1 c08e641e-cf01-… Litoria chlor… https://biodi…            24.5             152.
# ℹ 9 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, countryCode <chr>, locality <chr>, family <chr>,
#   genus <chr>, species <chr>, cl22 <chr>

So far this observation does seem to be in Japan. To be extra certain, we can also use the column locality, which provides additional information from the data collector about the record’s location.

frogs |>
  filter(countryCode == "JP") |>
  select(countryCode, locality, scientificName, decimalLatitude, decimalLongitude)
# A tibble: 1 × 5
  countryCode locality scientificName  decimalLatitude decimalLongitude
  <chr>       <chr>    <chr>                     <dbl>            <dbl>
1 JP          mt bucca Litoria chloris            24.5             152.

The locality column reveals the observation was made in “mt bucca”. This is surprising to see because Mt Bucca is a mountain in Queensland!

When we look at our Japan data point’s decimalLongitude and decimalLatitude alongside other values in our data, it becomes clear that the Japan data point seems to sit within the same numerical range as other points, but the decimalLatitude is positive rather than negative.

frogs |>
  arrange(desc(countryCode)) |>
  select(countryCode, decimalLongitude, decimalLatitude) |>
  print(n = 5)
# A tibble: 30 × 3
  countryCode decimalLongitude decimalLatitude
  <chr>                  <dbl>           <dbl>
1 JP                      152.            24.5
2 AU                      153.           -28.4
3 AU                      153.           -28.4
4 AU                      153.           -28.2
5 AU                      149.           -20.3
# ℹ 25 more rows

All of this evidence suggests that our Japan “outlier” might instead be an occurrence point with a mis-entered latitude coordinate.

Let’s fix this by adding a negative symbol (-) to the record’s latitude coordinate number. We’ll use case_when() from dplyr to specify that if the countryCode == "JP", then we’ll multiply the decimalLatitude by -1, reversing the symbol.

frogs_fixed <- frogs |>
  mutate(
    decimalLatitude = case_when(
      countryCode == "JP" ~ decimalLatitude * -1, 
      .default = decimalLatitude 
    ))

frogs_fixed |>
  filter(countryCode == "JP") |> 
  select(decimalLatitude, decimalLongitude, countryCode)
# A tibble: 1 × 3
  decimalLatitude decimalLongitude countryCode
            <dbl>            <dbl> <chr>      
1           -24.5             152. JP         

Mapping our data again shows our outlier is an outlier no longer!

Code
ggplot() +
  geom_sf(data = aus,
          colour = "grey60") +
  geom_point(data = frogs_fixed,
             aes(x = decimalLongitude,
                 y = decimalLatitude),
             colour = "#557755")

10.3 Excluding unfixable coordinate errors

Some coordinates issues cannot be fixed or inferred. In this case, it is important that you identify which records have issues and remove them prior to analysis. Here are some examples of geospatial errors that might need to be identified and removed in your dataset.

10.3.1 Flipped coordinates

Records with flipped coordinates typically appear as a group of points in an unexpected location. Although sometimes they can be fixed, this is not always the case.

Let’s use occurrence records of Kowari (a native, carnivorous mouse species) as an example. Including the COUNTRY_COORDINATE_MISMATCH assertion column when downloading records using the galah package allows us to identify records flagged as having mismatches between coordinates and country metadata.

native_mice <- native_mice |>
  drop_na(decimalLongitude, decimalLatitude)
  
native_mice |>
  select(COUNTRY_COORDINATE_MISMATCH, everything())
# A tibble: 1,334 × 131
   COUNTRY_COORDINATE_MISMATCH scientificName   decimalLongitude decimalLatitude
   <lgl>                       <chr>                       <dbl>           <dbl>
 1 FALSE                       Dasyuroides byr…             140.           -24.1
 2 FALSE                       Dasyuroides byr…             141.           -23.8
 3 FALSE                       Dasyuroides byr…             140.           -27.0
 4 FALSE                       Dasyuroides byr…             139.           -26.8
 5 FALSE                       Dasyuroides byr…             140.           -27.0
 6 FALSE                       Dasyuroides byr…             140.           -26.9
 7 FALSE                       Dasyuroides byr…             141.           -23.8
 8 FALSE                       Dasyuroides byr…             139.           -26.8
 9 FALSE                       Dasyuroides byr…             139.           -25.7
10 FALSE                       Dasyuroides byr…             140.           -26.9
# ℹ 1,324 more rows
# ℹ 127 more variables: eventDate <dttm>, country <chr>, countryCode <chr>,
#   locality <chr>, AMBIGUOUS_COLLECTION <lgl>, AMBIGUOUS_INSTITUTION <lgl>,
#   BASIS_OF_RECORD_INVALID <lgl>, biosecurityIssue <lgl>,
#   COLLECTION_MATCH_FUZZY <lgl>, COLLECTION_MATCH_NONE <lgl>,
#   CONTINENT_COORDINATE_MISMATCH <lgl>, CONTINENT_COUNTRY_MISMATCH <lgl>,
#   CONTINENT_DERIVED_FROM_COORDINATES <lgl>, …

Sometimes, flipped coordinates can be fixed by switching the latitude and longitude coordinates. Other times, like in this example, the way to fix the coordinates isn’t obvious.

ggplot() + 
  geom_sf(data = aus) +
  geom_point(data = native_mice,
             aes(x = decimalLongitude,
                 y = decimalLatitude,
             colour = COUNTRY_COORDINATE_MISMATCH)) + 
  pilot::scale_color_pilot()

To remove these data, we can filter the dataset to exclude records that do not fall within Australia’s minimum and maximum coordinates.

native_mice_filtered <- native_mice |>
  filter(decimalLongitude > 100,
         decimalLongitude < 155,
         decimalLatitude > -45,
         decimalLatitude < -10)

ggplot() + 
  geom_sf(data = aus) +
  geom_point(data = native_mice_filtered,
             aes(x = decimalLongitude,
                 y = decimalLatitude,
             colour = COUNTRY_COORDINATE_MISMATCH)) + 
  pilot::scale_color_pilot()

10.3.2 Zero coordinates

Sometimes latitude and/or longitude data are recorded as having zero values; these values are not accurate representations of locations and thus should be removed.

Let’s use acacia data as an example. Including the ZERO_COORDINATE assertion column to your download allows us to identify records flagged as having zero values in the coordinate fields.

acacias <- acacias |>
  drop_na(decimalLatitude, decimalLongitude) # remove NA values

acacias |>
  select(ZERO_COORDINATE, everything())
# A tibble: 10,804 × 11
   ZERO_COORDINATE recordID        scientificName taxonConceptID decimalLatitude
   <lgl>           <chr>           <chr>          <chr>                    <dbl>
 1 FALSE           0013ae12-fda4-… Acacia aneura  https://id.bi…           -31.5
 2 FALSE           00197d65-f235-… Acacia aneura  https://id.bi…           -29.5
 3 FALSE           001a3cbb-a370-… Acacia aneura… https://id.bi…           -28.0
 4 FALSE           00238db7-6c4e-… Acacia aneura… https://id.bi…           -34.1
 5 FALSE           00276566-b590-… Acacia aneura  https://id.bi…           -29.7
 6 FALSE           0029f3cf-b541-… Acacia aneura  https://id.bi…           -29.0
 7 FALSE           0034f771-a3e1-… Acacia aneura  https://id.bi…           -31.1
 8 FALSE           0035cb4f-85e5-… Acacia aneura… https://id.bi…           -29.5
 9 FALSE           003cce3b-f3f8-… Acacia aneura  https://id.bi…           -29.1
10 FALSE           0049bcbb-c8c2-… Acacia aneura  https://id.bi…           -25.5
# ℹ 10,794 more rows
# ℹ 6 more variables: decimalLongitude <dbl>, eventDate <dttm>,
#   occurrenceStatus <chr>, dataResourceName <chr>, countryCode <chr>,
#   locality <chr>

We can see the flagged record in orange on our map.

ggplot() + 
  geom_sf(data = aus) +
  geom_point(data = acacias,
             aes(x = decimalLongitude, 
                 y = decimalLatitude,
                 colour = ZERO_COORDINATE)) +
  pilot::scale_color_pilot()

We can remove this record by filtering our dataset to remove records with longitude or latitude coordinates that equal zero.

acacias_filtered <- acacias |>
  filter(decimalLongitude != 0,
         decimalLatitude != 0)

ggplot() + 
  geom_sf(data = aus) +
  geom_point(data = acacias_filtered,
             aes(x = decimalLongitude, 
                 y = decimalLatitude,
                 colour = ZERO_COORDINATE)) +
  pilot::scale_color_pilot()

10.3.3 Centroids

Centroids, or coordinates that mark the exact centre point of an area, are sometimes assigned to an occurrence record when the original observation location was provided as a description. If a record was collected using a vague locality description or through incorrect geo-referencing, centroids can be used to categorise the record into broadly the correct area1.

Let’s use common brown butterfly data for our example. Including the COORDINATES_CENTRE_OF_COUNTRY and/or COORDINATES_CENTRE_OF_STATEPROVINCE assertions columns to your download allows us to identify records flagged as containing centroid coordinates.

butterflies <- butterflies |>
  drop_na(decimalLatitude, decimalLongitude) # remove NA values

butterflies |>
  select(COORDINATES_CENTRE_OF_COUNTRY,
         COORDINATES_CENTRE_OF_STATEPROVINCE,
         everything())
# A tibble: 338 × 12
   COORDINATES_CENTRE_OF_COUNTRY COORDINATES_CENTRE_OF…¹ recordID scientificName
   <lgl>                         <lgl>                   <chr>    <chr>         
 1 FALSE                         FALSE                   018f5a5… Heteronympha …
 2 FALSE                         FALSE                   02eef43… Heteronympha …
 3 FALSE                         FALSE                   03b39bb… Heteronympha …
 4 FALSE                         FALSE                   04c2ac2… Heteronympha …
 5 FALSE                         FALSE                   05cced1… Heteronympha …
 6 FALSE                         FALSE                   05ceb8b… Heteronympha …
 7 FALSE                         FALSE                   06679b8… Heteronympha …
 8 FALSE                         FALSE                   0704e7b… Heteronympha …
 9 FALSE                         FALSE                   0756bd4… Heteronympha …
10 FALSE                         FALSE                   0774f2e… Heteronympha …
# ℹ 328 more rows
# ℹ abbreviated name: ¹​COORDINATES_CENTRE_OF_STATEPROVINCE
# ℹ 8 more variables: taxonConceptID <chr>, decimalLatitude <dbl>,
#   decimalLongitude <dbl>, eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, countryCode <chr>, locality <chr>

Filtering our data to flagged records, we return one record.

butterflies |>
  filter(
    COORDINATES_CENTRE_OF_COUNTRY == TRUE |
    COORDINATES_CENTRE_OF_STATEPROVINCE == TRUE
    )
# A tibble: 1 × 12
  recordID        scientificName taxonConceptID decimalLatitude decimalLongitude
  <chr>           <chr>          <chr>                    <dbl>            <dbl>
1 89186e67-be72-… Heteronympha … https://biodi…           -31.3             147.
# ℹ 7 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, countryCode <chr>, locality <chr>,
#   COORDINATES_CENTRE_OF_COUNTRY <lgl>,
#   COORDINATES_CENTRE_OF_STATEPROVINCE <lgl>

The flagged record is the single orange point on our map.

ggplot() + 
  geom_sf(data = aus) +
  geom_point(data = butterflies,
             aes(x = decimalLongitude, 
                 y = decimalLatitude,
                 colour = COORDINATES_CENTRE_OF_STATEPROVINCE)) +
  pilot::scale_color_pilot() +
  theme(legend.position = "none")

We can remove this data point by excluding this record from our dataset.

butterflies_filtered <- butterflies |>
  filter(COORDINATES_CENTRE_OF_STATEPROVINCE == FALSE)

ggplot() + 
  geom_sf(data = aus) +
  geom_point(data = butterflies_filtered,
             aes(x = decimalLongitude, 
                 y = decimalLatitude,
                 colour = COORDINATES_CENTRE_OF_STATEPROVINCE)) +
  pilot::scale_color_pilot() +
  theme(legend.position = "none")

10.3.4 Cities, zoos, aquariums, museums & herbaria

Some observations are recorded in locations where animals and plants live but do not naturally occur. A common example is observations recorded at public facilities like zoos, aquariums, and botanic gardens.

Other times, observations are recorded in places where specimens of animals and plants might be stored, but not where they were observed. Common examples are museums and herbaria.

In some cases, like with records of the Gorse Bitter-pea, these locations can appear suspicious but not overly obvious. When we map these observations, there is a tailing distribution of points in Western Australia with several points located near the west coast of Australia.

bitter_peas <- bitter_peas |>
  drop_na(decimalLongitude, decimalLatitude) # remove NA values

ggplot() + 
  geom_sf(data = aus) +
  geom_point(data = bitter_peas,
             aes(x = decimalLongitude,
                 y = decimalLatitude),
             colour = "#204466")

Suspiciously, if we Google the coordinates of the Western Australia Herbarium, the coordinates overlap with one of the points. We have highlighted this point in orange.

ggplot() + 
  geom_sf(data = aus) +
  geom_point(data = bitter_peas,
             aes(x = decimalLongitude,
                 y = decimalLatitude),
             colour = "#204466") +
  geom_point(aes(x = 115.8, y = -31.9), # point coordinates
             colour = "#f28100") +
  theme(legend.position = "none")

Filtering our data to the two left-most data points reveals that the data resources that supplied those records are both state herbaria.

bitter_peas |>
  filter(decimalLongitude < 120) |>
  select(dataResourceName)
# A tibble: 2 × 1
  dataResourceName                             
  <chr>                                        
1 National Herbarium of Victoria (MEL) AVH data
2 NSW AVH feed                                 

Having identified this, these records can now be removed from our dataset.

bitter_peas_filtered <- bitter_peas |>
  filter(decimalLongitude > 120)

ggplot() + 
  geom_sf(data = aus) +
  geom_point(data = bitter_peas_filtered,
             aes(x = decimalLongitude,
                 y = decimalLatitude),
             colour = "#204466")

You can use the field basisOfRecord to avoid including records from museums and herbaria when creating your query in galah.

library(galah)

# Show values in `basisOfRecord` field
search_all(fields, "basisOfRecord") |>
  show_values()

# Filter basis of record to only human observations
galah_call() |>
  identify("Daviesia ulicifolia") |>
  filter(basisOfRecord == "HUMAN_OBSERVATION") |>
  atlas_counts()

10.4 Packages

Other packages exist to make identifying and cleaning geospatial coordinates more streamlined. The advantage of using these packages is that they can run many checks over coordinates at one time, rather than identifying each error separately like we did over this chapter. This process can make finding possible spatial outliers faster. The disadvantage is that checks might be more difficult to tweak compared to manual checks. Manual checks can also make the steps you made to clean your data clearer (and easier to edit later) in a complete data cleaning workflow.

Choose the package (or mix of packages and functions) that work best for you and your data cleaning needs.

CoordinateCleaner

The CoordinateCleaner package is a package for automated flagging of common spatial and temporal errors of biological and palaentological data. It is particularly useful for cleaning data from GBIF.

Here is an example of a general cleaning function, but there are many more bespoke options that the package offers.

library(CoordinateCleaner)

# Run record-level tests
coordinate_tests <- clean_coordinates(x = butterflies, 
                                      species = "scientificName")
Reading layer `ne_50m_land' from data source 
  `C:\Users\KEL329\AppData\Local\Temp\RtmpknkwLr\ne_50m_land.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 1420 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -180 ymin: -89.99893 xmax: 180 ymax: 83.59961
Geodetic CRS:  WGS 84
summary(coordinate_tests)
    .val     .equ     .zer     .cap     .cen     .sea     .otl     .gbf 
       0        0        0       22        0       11       60        0 
   .inst .summary 
      13      100 
plot(coordinate_tests)

10.5 Summary

Each of the cleaning steps in this chapter do not have to be run in order, or even at all. Whether they are used is context- and taxon-dependent. As an example, what is one species that has many “wrong” coordinates based on many of the steps listed above?

The Great White Shark.

Code
# Download occurrence records
sharks <- galah_call() |>
  identify("Carcharodon carcharias") |>
  filter(basisOfRecord == "HUMAN_OBSERVATION") |>
  apply_profile(ALA) |>
  atlas_occurrences()

# Retrieve map of Australia
aus <- st_transform(ozmap_country, 4326)

# Map occurrences
sharks |>
  drop_na(decimalLongitude, decimalLatitude) |>
  ggplot() + 
  geom_sf(data = aus,
          colour = "grey60",
          fill = "white") +
  geom_point(data = sharks,
             aes(x = decimalLongitude,
                 y = decimalLatitude),
             colour = "#135277") +
  theme_light()

The difficulty with cleaning Great White Shark occurrence data is that these sharks have a massive habitat range, and these locations along (what appear to be) the North American coast and Madagascar could very well be true occurrences. Be sure to consider the taxonomic and spatial range of your species before jumping into data cleaning!


  1. This can happen when record locations are incorrectly given as the physical location of the specimen, or because they represent individuals from captivity or grown in horticulture (but were not clearly labelled as such).↩︎