# packages
library(galah)
library(dplyr)
galah_config(email = "your-email-here") # ALA-registered email
<- galah_call() |>
banksia filter(doi == "https://doi.org /10.26197/ala.461d2169-48a7-47fe-9419-13ba1b93160a") |>
atlas_occurrences()
<- galah_call() |>
quokkas filter(doi == "https://doi.org /10.26197/ala.5b75fbae-6bd8-481d-aea7-c5112f2345e1") |>
atlas_occurrences()
9 Geospatial investigation
An important part of observational data is the location, specifying where each observation of an organism or species took place. These locations can range from locality descriptions (e.g. “Near Tellera Hill station”) to exact longitude and latitude coordinates tracked by a GPS system. The accuracy of these geospatial data will determine the types of ecological analyses you can perform. It is important to know the precision of these observations, along with the range of uncertainty around an observation’s location, to contextualize any findings or conclusions made using the data.
In this chapter, we will discuss some different ways to assess the precision and uncertainty of coordinates associated with occurrence records, and highlight how to identify spatial characteristics or errors when visualizing occurrences on maps.
9.0.1 Prerequisites
In this chapter we’ll use data of Banksia serrata occurrence records since 2022 and quokka occurrence records from the ALA.
To check data quality, data infrastructures like the Atlas of Living Australia have assertions—data quality tests that data infrastructures use to flag when a record has an issue. The results of these assertions are saved in assertions columns that can be accessed by users if they would like them.
If we use galah to download records, we can add assertions columns to our query to help identify and clean suspicious records.
If you would like to view assertions, use show_all()
.
<- show_all(assertions)
assertions |>
assertions print(n = 7)
# A tibble: 124 × 4
id description category type
<chr> <chr> <chr> <chr>
1 AMBIGUOUS_COLLECTION Ambiguous collection Comment assertio…
2 AMBIGUOUS_INSTITUTION Ambiguous institution Comment assertio…
3 BASIS_OF_RECORD_INVALID Basis of record badly formed Warning assertio…
4 biosecurityIssue Biosecurity issue Error assertio…
5 COLLECTION_MATCH_FUZZY Collection match fuzzy Comment assertio…
6 COLLECTION_MATCH_NONE Collection not matched Comment assertio…
7 CONTINENT_COORDINATE_MISMATCH continent coordinate mismatch Warning assertio…
# ℹ 117 more rows
You can use the stringr package to search for text matches.
|>
assertions filter(str_detect(id, "COORDINATE")) |>
print(n = 5)
# A tibble: 17 × 4
id description category type
<chr> <chr> <chr> <chr>
1 CONTINENT_COORDINATE_MISMATCH continent coordinate mismat… Warning asse…
2 CONTINENT_DERIVED_FROM_COORDINATES Continent derived from coor… Warning asse…
3 COORDINATE_INVALID Coordinate invalid Warning asse…
4 COORDINATE_OUT_OF_RANGE Coordinates are out of rang… Warning asse…
5 COORDINATE_PRECISION_INVALID Coordinate precision invalid Warning asse…
# ℹ 12 more rows
In this chapter, we will detail when an assertion column can help identify occurrence records with geospatial issues.
9.1 Quick visualisation
Mentioned in the Inspect chapter, one of the most straightforward ways to check for spatial errors is to plot your data onto a map. More obvious spatial errors are much easier to spot visually.
In most spatial datasets, the most important columns are decimalLatitude
and decimalLongitude
(or similarly named columns). These contain the latitude and longitude of each observation in decimal form (rather than degrees).
# Retrieve map of Australia
<- st_transform(ozmap_country, 4326)
aus
# A quick plot of banksia occurrences
ggplot() +
geom_sf(data = aus, colour = "black", fill = NA) +
geom_point(data = banksia,
aes(x = decimalLongitude,
y = decimalLatitude),
colour = "orchid")
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
In a quick glance, we can check whether there are any records in places that they shouldn’t be. Are there records in the ocean? Are there records in states/territories where our species definitely doesn’t live? Is the data too sparse to use for our expected analysis plan?
Lucky for us, the banksia
data we just plotted doesn’t seem to have any obvious issues!
9.2 Precision
Not all observations have the same degree of precision. Coordinate precision can vary between data sources and recording equipment. For example, coordinates recorded with a GPS unit or a phone generally have higher precision than coordinates recorded manually from a locality description.
The degree of precision you require will depend on the granularity of your research question and analysis. A fine-scale question will require data measured at a fine-scale to answer it. National or global scale questions require less precise data.
When downloading data from the ALA with the galah package, it’s possible to include the coordinatePrecision
field with your data; this provides a decimal representation of the precision of coordinates for each observation.
|>
banksia select(scientificName,
coordinatePrecision|>
) filter(!is.na(coordinatePrecision))
- 1
-
Not all records have this information recorded, so we also filter to only records with a
coordinatePrecision
value.
# A tibble: 84 × 2
scientificName coordinatePrecision
<chr> <dbl>
1 Banksia serrata 0.000000001
2 Banksia serrata 0.000000001
3 Banksia serrata 0.000000001
4 Banksia serrata 0.000000001
5 Banksia serrata 0.000000001
6 Banksia serrata 0.000000001
7 Banksia serrata 0.000000001
8 Banksia serrata 0.000000001
9 Banksia serrata 0.000000001
10 Banksia serrata 0.000000001
# ℹ 74 more rows
Only a few records have coordinatePrecision
recorded, but that subset of records are very precise.
|>
banksia group_by(coordinatePrecision) |>
count()
# A tibble: 2 × 2
# Groups: coordinatePrecision [2]
coordinatePrecision n
<dbl> <int>
1 0.000000001 84
2 NA 804
Filter your records to only those under a specific measure of precision.
# Filter by number of decimal places
<- banksia |>
banksia filter(coordinatePrecision <= 0.001)
9.3 Uncertainty
Similarly, not all observations have the same degree of accuracy. An organism’s exact location will likely have an area of uncertainty around it, which can grow or shrink depending on the method of observation and the species observed, similar to coordinate precision. However, a main distinction between record uncertainty and record precision is that data infrastructures like the ALA can add uncertainty to a record. Obscuring a record’s exact location is usually for sensitivity purposes. Although obscuring data is important for protecting individual species, uncertainty inevitably affects how robust the results from species distribution models are, so it is important to be aware of location uncertainty.
When downloading data from the ALA with the galah package, it’s possible to include the coordinateUncertaintyInMeters
field with your data. This refers to the margin of error, represented as a circular area, around the true location of the recorded observation. We added this column in our original galah query.
|>
banksia select(scientificName,
coordinateUncertaintyInMeters )
# A tibble: 888 × 2
scientificName coordinateUncertaintyInMeters
<chr> <dbl>
1 Banksia serrata 316
2 Banksia serrata 10
3 Banksia serrata 2268
4 Banksia serrata NA
5 Banksia serrata NA
6 Banksia serrata NA
7 Banksia serrata 15
8 Banksia serrata 4
9 Banksia serrata NA
10 Banksia serrata NA
# ℹ 878 more rows
There is a range of coordinate uncertainty in our data, with many falling within 10m of uncertainty.
|>
banksia count(coordinateUncertaintyInMeters)
# A tibble: 173 × 2
coordinateUncertaintyInMeters n
<dbl> <int>
1 1 4
2 2 10
3 3 38
4 4 156
5 5 59
6 6 21
7 7 15
8 8 29
9 9 23
10 10 76
# ℹ 163 more rows
If your analysis requires greater certainty, you can then filter your records to a smaller area of uncertainty.
# Filter by number of decimal places
<- banksia |>
banksia filter(coordinateUncertaintyInMeters <= 5)
9.4 Obscured location
Occurrence records of sensitive, endangered, or critically endangered species may be deliberately obscured (i.e. generalised or obfuscated) to protect the true locations of these species. This process blurs an organism’s actual location to avoid risks like poaching or capture while still allowing their data to be included in broader summaries.
In the ALA, the field dataGeneralizations
indicates whether a record has been has been obscured and provides information on the size of the area to which the point has been generalised.
dataGeneralizations
The dataGeneralizations
field will only be available to use or download when there are records in your query that have been generalised/obscured.
search_all(fields, "dataGeneralization")
# A tibble: 2 × 3
id description type
<chr> <chr> <chr>
1 dataGeneralizations Data Generalised during processing fields
2 raw_dataGeneralizations <NA> fields
For example, the Western Swamp Tortoise is a critically endangered species in Western Australia. There are 127 records of this species in the ALA.
galah_call() |>
identify("Pseudemydura umbrina") |>
atlas_counts()
# A tibble: 1 × 1
count
<int>
1 127
Grouping record counts by the dataGeneralizations
column shows that 126 of the 127 records have been obscured by 10 km.
galah_call() |>
identify("Pseudemydura umbrina") |>
group_by(dataGeneralizations) |>
atlas_counts()
# A tibble: 1 × 2
dataGeneralizations count
<chr> <int>
1 Record is Critically Endangered in Western Australia. Generalised to 10… 126
What do obscured data look like?
Quokka data offer a nice example of what to look for when data points have been obscured. When plotted, obscured occurrence data appear as if points were placed onto a grid 1.
# remove records with missing coordinates
<- quokkas |>
quokkas ::drop_na(decimalLatitude, decimalLongitude)
tidyr
# aus map
<- ozmap_country |> st_transform(4326)
aus
# map quokka occurrences
ggplot() +
geom_sf(data = aus, colour = "black", fill = NA) +
geom_point(data = quokkas,
aes(x = decimalLongitude,
y = decimalLatitude,
colour = dataGeneralizations |>
str_wrap(18))) +
scale_colour_manual(values = c("sienna3", "snow4"),
guide = guide_legend(position = "bottom")) +
guides(colour = guide_legend(title = "Data\ngeneralizations")) +
xlim(114,120) +
ylim(-36,-31)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Keep in mind that survey data can also appear gridded if survey locations were evenly spaced, so be sure to double check before assuming data have been obscured!
For more information, check out the ALA’s support article about working with threatened, migratory and sensitive species.
9.5 Summary
In this chapter, we showed some ways to investigate the geospatial coordinates of your data and determine the level of precision, uncertainty (accuracy?), or obfucsation.
In the next chapter, we’ll see examples of issues with coordinates that require correcting or removing.
This is what actually happened—locations have been “snapped” onto a grid determined by the generalised distance.↩︎