9  Geospatial investigation

An important part of observational data is the location, specifying where each observation of an organism or species took place. These locations can range from locality descriptions (e.g. “Near Tellera Hill station”) to exact longitude and latitude coordinates tracked by a GPS system. The accuracy of these geospatial data will determine the types of ecological analyses you can perform. It is important to know the precision of these observations, along with the range of uncertainty around an observation’s location, to contextualize any findings or conclusions made using the data.

In this chapter, we will discuss some different ways to assess the precision and uncertainty of coordinates associated with occurrence records, and highlight how to identify spatial characteristics or errors when visualizing occurrences on maps.

9.0.1 Prerequisites

In this chapter we’ll use data of Banksia serrata occurrence records since 2022 and quokka occurrence records from the ALA.

# packages
library(galah)
library(dplyr)
galah_config(email = "your-email-here") # ALA-registered email

banksia <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.461d2169-48a7-47fe-9419-13ba1b93160a") |>
  atlas_occurrences()

quokkas <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.5b75fbae-6bd8-481d-aea7-c5112f2345e1") |>
  atlas_occurrences()

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

banksia <- galah_call() |>
  identify("banksia serrata") |>
  filter(year > 2022) |>
  select(group = "basic",
         coordinatePrecision, 
         coordinateUncertaintyInMeters) |>
  atlas_occurrences()

quokkas <- galah_call() |>
  identify("Setonix brachyurus") |>
  galah_select(group = "basic", 
               dataGeneralizations) |>
  atlas_occurrences()
1
We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

To check data quality, data infrastructures like the Atlas of Living Australia have assertions—data quality tests that data infrastructures use to flag when a record has an issue. The results of these assertions are saved in assertions columns that can be accessed by users if they would like them.

If we use galah to download records, we can add assertions columns to our query to help identify and clean suspicious records.

If you would like to view assertions, use show_all().

assertions <- show_all(assertions)
assertions |>
  print(n = 7)
# A tibble: 124 × 4
  id                            description                   category type     
  <chr>                         <chr>                         <chr>    <chr>    
1 AMBIGUOUS_COLLECTION          Ambiguous collection          Comment  assertio…
2 AMBIGUOUS_INSTITUTION         Ambiguous institution         Comment  assertio…
3 BASIS_OF_RECORD_INVALID       Basis of record badly formed  Warning  assertio…
4 biosecurityIssue              Biosecurity issue             Error    assertio…
5 COLLECTION_MATCH_FUZZY        Collection match fuzzy        Comment  assertio…
6 COLLECTION_MATCH_NONE         Collection not matched        Comment  assertio…
7 CONTINENT_COORDINATE_MISMATCH continent coordinate mismatch Warning  assertio…
# ℹ 117 more rows

You can use the stringr package to search for text matches.

assertions |>
  filter(str_detect(id, "COORDINATE")) |>
  print(n = 5)
# A tibble: 17 × 4
  id                                 description                  category type 
  <chr>                              <chr>                        <chr>    <chr>
1 CONTINENT_COORDINATE_MISMATCH      continent coordinate mismat… Warning  asse…
2 CONTINENT_DERIVED_FROM_COORDINATES Continent derived from coor… Warning  asse…
3 COORDINATE_INVALID                 Coordinate invalid           Warning  asse…
4 COORDINATE_OUT_OF_RANGE            Coordinates are out of rang… Warning  asse…
5 COORDINATE_PRECISION_INVALID       Coordinate precision invalid Warning  asse…
# ℹ 12 more rows

In this chapter, we will detail when an assertion column can help identify occurrence records with geospatial issues.

9.1 Quick visualisation

Mentioned in the Inspect chapter, one of the most straightforward ways to check for spatial errors is to plot your data onto a map. More obvious spatial errors are much easier to spot visually.

In most spatial datasets, the most important columns are decimalLatitude and decimalLongitude (or similarly named columns). These contain the latitude and longitude of each observation in decimal form (rather than degrees).

# Retrieve map of Australia
aus <- st_transform(ozmap_country, 4326)

# A quick plot of banksia occurrences
ggplot() + 
  geom_sf(data = aus, colour = "black", fill = NA) + 
  geom_point(data = banksia, 
             aes(x = decimalLongitude,
                 y = decimalLatitude),
             colour = "orchid")
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

In a quick glance, we can check whether there are any records in places that they shouldn’t be. Are there records in the ocean? Are there records in states/territories where our species definitely doesn’t live? Is the data too sparse to use for our expected analysis plan?

Lucky for us, the banksia data we just plotted doesn’t seem to have any obvious issues!

9.2 Precision

Not all observations have the same degree of precision. Coordinate precision can vary between data sources and recording equipment. For example, coordinates recorded with a GPS unit or a phone generally have higher precision than coordinates recorded manually from a locality description.

The degree of precision you require will depend on the granularity of your research question and analysis. A fine-scale question will require data measured at a fine-scale to answer it. National or global scale questions require less precise data.

When downloading data from the ALA with the galah package, it’s possible to include the coordinatePrecision field with your data; this provides a decimal representation of the precision of coordinates for each observation.

banksia |>
  select(scientificName, 
         coordinatePrecision
         ) |>
  filter(!is.na(coordinatePrecision))
1
Not all records have this information recorded, so we also filter to only records with a coordinatePrecision value.
# A tibble: 84 × 2
   scientificName  coordinatePrecision
   <chr>                         <dbl>
 1 Banksia serrata         0.000000001
 2 Banksia serrata         0.000000001
 3 Banksia serrata         0.000000001
 4 Banksia serrata         0.000000001
 5 Banksia serrata         0.000000001
 6 Banksia serrata         0.000000001
 7 Banksia serrata         0.000000001
 8 Banksia serrata         0.000000001
 9 Banksia serrata         0.000000001
10 Banksia serrata         0.000000001
# ℹ 74 more rows

Only a few records have coordinatePrecision recorded, but that subset of records are very precise.

banksia |> 
  group_by(coordinatePrecision) |>
  count()
# A tibble: 2 × 2
# Groups:   coordinatePrecision [2]
  coordinatePrecision     n
                <dbl> <int>
1         0.000000001    84
2        NA             804

Filter your records to only those under a specific measure of precision.

# Filter by number of decimal places
banksia <- banksia |>
  filter(coordinatePrecision <= 0.001)

9.3 Uncertainty

Similarly, not all observations have the same degree of accuracy. An organism’s exact location will likely have an area of uncertainty around it, which can grow or shrink depending on the method of observation and the species observed, similar to coordinate precision. However, a main distinction between record uncertainty and record precision is that data infrastructures like the ALA can add uncertainty to a record. Obscuring a record’s exact location is usually for sensitivity purposes. Although obscuring data is important for protecting individual species, uncertainty inevitably affects how robust the results from species distribution models are, so it is important to be aware of location uncertainty.

When downloading data from the ALA with the galah package, it’s possible to include the coordinateUncertaintyInMeters field with your data. This refers to the margin of error, represented as a circular area, around the true location of the recorded observation. We added this column in our original galah query.

banksia |>
  select(scientificName,
         coordinateUncertaintyInMeters
         )
# A tibble: 888 × 2
   scientificName  coordinateUncertaintyInMeters
   <chr>                                   <dbl>
 1 Banksia serrata                           316
 2 Banksia serrata                            10
 3 Banksia serrata                          2268
 4 Banksia serrata                            NA
 5 Banksia serrata                            NA
 6 Banksia serrata                            NA
 7 Banksia serrata                            15
 8 Banksia serrata                             4
 9 Banksia serrata                            NA
10 Banksia serrata                            NA
# ℹ 878 more rows

There is a range of coordinate uncertainty in our data, with many falling within 10m of uncertainty.

banksia |> 
  count(coordinateUncertaintyInMeters) 
# A tibble: 173 × 2
   coordinateUncertaintyInMeters     n
                           <dbl> <int>
 1                             1     4
 2                             2    10
 3                             3    38
 4                             4   156
 5                             5    59
 6                             6    21
 7                             7    15
 8                             8    29
 9                             9    23
10                            10    76
# ℹ 163 more rows

If your analysis requires greater certainty, you can then filter your records to a smaller area of uncertainty.

# Filter by number of decimal places
banksia <- banksia |>
  filter(coordinateUncertaintyInMeters <= 5)

9.4 Obscured location

Occurrence records of sensitive, endangered, or critically endangered species may be deliberately obscured (i.e. generalised or obfuscated) to protect the true locations of these species. This process blurs an organism’s actual location to avoid risks like poaching or capture while still allowing their data to be included in broader summaries.

In the ALA, the field dataGeneralizations indicates whether a record has been has been obscured and provides information on the size of the area to which the point has been generalised.

The dataGeneralizations field will only be available to use or download when there are records in your query that have been generalised/obscured.

search_all(fields, "dataGeneralization")
# A tibble: 2 × 3
  id                      description                        type  
  <chr>                   <chr>                              <chr> 
1 dataGeneralizations     Data Generalised during processing fields
2 raw_dataGeneralizations <NA>                               fields

For example, the Western Swamp Tortoise is a critically endangered species in Western Australia. There are 127 records of this species in the ALA.

galah_call() |>
  identify("Pseudemydura umbrina") |>
  atlas_counts()
# A tibble: 1 × 1
  count
  <int>
1   127

Grouping record counts by the dataGeneralizations column shows that 126 of the 127 records have been obscured by 10 km.

galah_call() |>
  identify("Pseudemydura umbrina") |>
  group_by(dataGeneralizations) |>
  atlas_counts()
# A tibble: 1 × 2
  dataGeneralizations                                                      count
  <chr>                                                                    <int>
1 Record is Critically Endangered in Western Australia. Generalised to 10…   126

What do obscured data look like?

Quokka data offer a nice example of what to look for when data points have been obscured. When plotted, obscured occurrence data appear as if points were placed onto a grid 1.

# remove records with missing coordinates
quokkas <- quokkas |>
  tidyr::drop_na(decimalLatitude, decimalLongitude)

# aus map
aus <- ozmap_country |> st_transform(4326)

# map quokka occurrences
ggplot() + 
  geom_sf(data = aus, colour = "black", fill = NA) + 
  geom_point(data = quokkas, 
             aes(x = decimalLongitude,
                 y = decimalLatitude,
                 colour = dataGeneralizations |>
                   str_wrap(18))) +
  scale_colour_manual(values = c("sienna3", "snow4"),
                      guide = guide_legend(position = "bottom")) +
  guides(colour = guide_legend(title = "Data\ngeneralizations")) +
  xlim(114,120) + 
  ylim(-36,-31)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Keep in mind that survey data can also appear gridded if survey locations were evenly spaced, so be sure to double check before assuming data have been obscured!

For more information, check out the ALA’s support article about working with threatened, migratory and sensitive species.

9.5 Summary

In this chapter, we showed some ways to investigate the geospatial coordinates of your data and determine the level of precision, uncertainty (accuracy?), or obfucsation.

In the next chapter, we’ll see examples of issues with coordinates that require correcting or removing.


  1. This is what actually happened—locations have been “snapped” onto a grid determined by the generalised distance.↩︎