4  Duplicates

Duplicate records can occur for a number of reasons. For instance, a duplicate record might appear in an individual dataset due to errors in data collection or entry, or occur when aggregating multiple data sources. Alternatively, a record might be considered a duplicate in the context of one type of analysis, but not another. For example, prior to running species distribution models, records in the same location—even if they are separate observations—are considered duplicates and should be removed to avoid spatial bias. If you’re running multiple models for several time-periods, however, you may need to include records in the same location if they occurred in different time-periods. Context is key when determining how to identify and clean duplicate records in your dataset.

Identifying duplicates is important to avoid misleading analyses or visualisations. Duplicates can give the impression that there are more data than there really are and bias your analyses to favour certain species, locations, or time periods. In this chapter we will introduce ways of detecting and handling duplicate records in biodiversity data.

4.0.1 Prerequisites

In this chapter, we will use kingfisher (Alcedinidae) occurrence data in 2023 from the ALA.

# packages
library(galah)
library(dplyr)
library(janitor)
galah_config(email = "your-email-here") # ALA-registered email

birds <- galah_call() |>
  filter(doi == "https://doi.org/10.26197/ala.7fbf98a2-e224-48b4-9fb5-b046b5644fd6") |>
  atlas_occurrences()

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

birds <- galah_call() |>
  identify("alcedinidae") |>
  filter(year == 2023) |>
  select(group = "basic", 
         family, genus, species, cl22, eventDate, month) |>
  atlas_occurrences()
1
We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

4.1 Find duplicates

As an first example, let’s remove all spatially-duplicated records, based on latitude and longitude coordinate values.

The first thing to do is find the duplicate records.

In the above tibble our results show that there are just over 27,000 records that overlap spatially with duplicate coordinates. That seems like a lot! It would be rare to remove duplicates so broadly without considering why we need to remove duplicates; we don’t necessarily want to remove all of them.

Instead, if we are interested in comparing species in our data, it might be more useful to find duplicate spatial records for each species. We can split our data by species and remove records where there is more than one observation of the same species in the same location. This should leave one observation for each species in each location.

To filter our duplicate data by species, we can first split our data by species…

birds |>
  group_split(species)
<list_of<
  tbl_df<
    recordID        : character
    scientificName  : character
    taxonConceptID  : character
    decimalLatitude : double
    decimalLongitude: double
    eventDate       : datetime<UTC>
    occurrenceStatus: character
    dataResourceName: character
    family          : character
    genus           : character
    species         : character
    cl22            : character
    month           : double
  >
>[11]>
[[1]]
# A tibble: 2,066 × 13
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 0007f679-255f… Ceyx azureus   https://biodi…           -35.3             149.
 2 00237f72-7b95… Ceyx azureus   https://biodi…           -33.6             151.
 3 002d4683-fdeb… Ceyx azureus   https://biodi…           -22.8             151.
 4 0030b417-ad83… Ceyx azureus   https://biodi…           -23.5             151.
 5 005c21b0-3066… Ceyx azureus   https://biodi…           -16.2             145.
 6 00671765-ed23… Ceyx azureus   https://biodi…           -36.1             145.
 7 0086cd20-926a… Ceyx azureus   https://biodi…           -35.6             150.
 8 00a29f49-65aa… Ceyx azureus   https://biodi…           -34.9             151.
 9 00a39798-5118… Ceyx azureus   https://biodi…           -17.1             146.
10 00c02d79-58ef… Ceyx azureus   https://biodi…           -16.2             145.
# ℹ 2,056 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
#   cl22 <chr>, month <dbl>

[[2]]
# A tibble: 197 × 13
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 00b3fbaf-20d1… Ceyx pusillus  https://biodi…           -12.3             131.
 2 02ec2065-820c… Ceyx pusillus  https://biodi…           -12.4             131.
 3 02f2b1e9-f706… Ceyx pusillus  https://biodi…           -16.9             146.
 4 030dd9ae-3f67… Ceyx pusillus  https://biodi…           -12.4             131.
 5 03a15b6a-b666… Ceyx pusillus  https://biodi…           -16.9             146.
 6 04c838f4-8641… Ceyx pusillus  https://biodi…           -12.4             131.
 7 058ef10d-247b… Ceyx pusillus  https://biodi…           -12.6             131.
 8 078b046a-9cf3… Ceyx pusillus  https://biodi…           -12.8             143.
 9 0791d280-a26c… Ceyx pusillus  https://biodi…           -12.4             131.
10 0843ea42-367f… Ceyx pusillus  https://biodi…           -16.1             145.
# ℹ 187 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
#   cl22 <chr>, month <dbl>

[[3]]
# A tibble: 1,378 × 13
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 00085f92-58d1… Dacelo (Dacel… https://biodi…           -14.2             132.
 2 005e94f6-d0d7… Dacelo (Dacel… https://biodi…           -12.7             143.
 3 0083fbfd-3f14… Dacelo (Dacel… https://biodi…           -17.5             141.
 4 00bd28e3-20aa… Dacelo (Dacel… https://biodi…           -13.4             132.
 5 00bde3fa-6095… Dacelo (Dacel… https://biodi…           -19.4             147.
 6 00c8ff2c-9282… Dacelo (Dacel… https://biodi…           -16.7             146.
 7 01462284-b0aa… Dacelo (Dacel… https://biodi…           -12.4             131.
 8 01db4718-55f5… Dacelo (Dacel… https://biodi…           -16.5             145.
 9 0268d927-927e… Dacelo (Dacel… https://biodi…           -14.5             132.
10 02712b54-8b8f… Dacelo (Dacel… https://biodi…           -19.4             147.
# ℹ 1,368 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
#   cl22 <chr>, month <dbl>

[[4]]
# A tibble: 28,107 × 13
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 00018120-2238… Dacelo (Dacel… https://biodi…           -27.4             153.
 2 0005aa70-a30f… Dacelo (Dacel… https://biodi…           -25.3             153.
 3 0006ff02-853a… Dacelo (Dacel… https://biodi…           -37.9             145.
 4 00076c8d-957e… Dacelo (Dacel… https://biodi…           -28.0             153.
 5 00080e38-ee2d… Dacelo (Dacel… https://biodi…           -33.1             150.
 6 000c7c7c-3603… Dacelo (Dacel… https://biodi…           -38.5             144.
 7 000f692d-8013… Dacelo (Dacel… https://biodi…           -19.3             147.
 8 000f87c2-9028… Dacelo (Dacel… https://biodi…           -35.3             149.
 9 0010012d-435e… Dacelo (Dacel… https://biodi…           -16.8             146.
10 001021bd-b8f1… Dacelo (Dacel… https://biodi…           -27.5             153.
# ℹ 28,097 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
#   cl22 <chr>, month <dbl>

[[5]]
# A tibble: 155 × 13
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 0120314d-38b4… Syma torotoro  https://biodi…           -12.6             143.
 2 04b2e915-4419… Syma torotoro  https://biodi…           -12.8             143.
 3 04d3acf0-acdb… Syma torotoro  https://biodi…           -12.7             143.
 4 0bf14fe4-01fc… Syma torotoro  https://biodi…           -12.7             143.
 5 0c0a3bc9-431f… Syma torotoro  https://biodi…           -12.7             143.
 6 0e428bdf-d6fb… Syma torotoro  https://biodi…           -10.8             142.
 7 0f29d245-adc8… Syma torotoro  https://biodi…           -12.6             143.
 8 0ff604e2-fbe3… Syma torotoro  https://biodi…           -10.8             142.
 9 1121aa01-3173… Syma torotoro  https://biodi…           -12.8             143.
10 131d66ad-46d8… Syma torotoro  https://biodi…           -12.8             143.
# ℹ 145 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
#   cl22 <chr>, month <dbl>

[[6]]
# A tibble: 566 × 13
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 004c61ec-ed26… Tanysiptera (… https://biodi…           -16.6             145.
 2 014b3666-a322… Tanysiptera (… https://biodi…           -16.6             145.
 3 0152bed4-9459… Tanysiptera (… https://biodi…           -12.8             143.
 4 01685613-fe2f… Tanysiptera (… https://biodi…           -12.7             143.
 5 019804dd-c3fc… Tanysiptera (… https://biodi…           -16.6             145.
 6 02a51caa-861f… Tanysiptera (… https://biodi…           -12.7             143.
 7 02f21d71-27c2… Tanysiptera (… https://biodi…           -10.8             142.
 8 02f96918-b627… Tanysiptera (… https://biodi…           -12.7             143.
 9 044df2a6-7d32… Tanysiptera (… https://biodi…           -16.2             145.
10 04baf9c0-fd82… Tanysiptera (… https://biodi…           -16.6             145.
# ℹ 556 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
#   cl22 <chr>, month <dbl>

[[7]]
# A tibble: 12 × 13
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 3aa845a0-902e… Todiramphus (… https://biodi…            1.32             104.
 2 5f59e40a-87d2… Todiramphus (… https://biodi…            1.28             104.
 3 789e7794-af27… Todiramphus (… https://biodi…            1.32             104.
 4 7c6e2ffa-73a8… Todiramphus (… https://biodi…          -28.2              154.
 5 806d0434-8d8f… Todiramphus (… https://biodi…            1.28             104.
 6 83d392d7-a15f… Todiramphus (… https://biodi…            1.28             104.
 7 85bd0100-ea5d… Todiramphus (… https://biodi…            1.28             104.
 8 aeda1763-cd85… Todiramphus (… https://biodi…            1.28             104.
 9 dac72d12-bf2c… Todiramphus (… https://biodi…            1.28             104.
10 e156dd38-95f5… Todiramphus (… https://biodi…            1.28             104.
11 e5574549-7cec… Todiramphus (… https://biodi…            1.28             104.
12 fd6435b4-32d6… Todiramphus (… https://biodi…            1.31             104.
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
#   cl22 <chr>, month <dbl>

[[8]]
# A tibble: 2,367 × 13
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 0024aef4-4a8b… Todiramphus (… https://biodi…           -19.4             147.
 2 004b8b6c-8a89… Todiramphus (… https://biodi…           -12.4             131.
 3 007b39cf-2660… Todiramphus (… https://biodi…           -27.4             153.
 4 00993728-10d2… Todiramphus (… https://biodi…           -27.1             153.
 5 009e7664-1b3e… Todiramphus (… https://biodi…           -27.4             153.
 6 00b1c419-0612… Todiramphus (… https://biodi…           -26.2             153.
 7 00cdcbdf-36a6… Todiramphus (… https://biodi…           -12.5             131.
 8 00deabe0-1d59… Todiramphus (… https://biodi…           -12.3             131.
 9 0120a6ea-66f1… Todiramphus (… https://biodi…           -30.4             153.
10 012e4a46-ead1… Todiramphus (… https://biodi…           -20.0             146.
# ℹ 2,357 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
#   cl22 <chr>, month <dbl>

[[9]]
# A tibble: 292 × 13
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 001c48ef-8329… Todiramphus (… https://biodi…           -31.9             142.
 2 016ec933-df3e… Todiramphus (… https://biodi…           -23.7             134.
 3 01833f43-b3d5… Todiramphus (… https://biodi…           -28.0             146.
 4 01e0d185-50ed… Todiramphus (… https://biodi…           -29.8             151.
 5 02d54da5-cfb0… Todiramphus (… https://biodi…           -19.0             146.
 6 033bf94d-cd11… Todiramphus (… https://biodi…           -20.0             140.
 7 06d7a1fc-9464… Todiramphus (… https://biodi…           -23.5             144.
 8 06e22048-0569… Todiramphus (… https://biodi…           -18.3             143.
 9 07fc074c-fff9… Todiramphus (… https://biodi…           -31.9             141.
10 09e67cf4-3fbf… Todiramphus (… https://biodi…           -17.7             140.
# ℹ 282 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
#   cl22 <chr>, month <dbl>

[[10]]
# A tibble: 9,851 × 13
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 00030a7e-a6cc… Todiramphus (… https://biodi…           -35.1             147.
 2 0005e748-e148… Todiramphus (… https://biodi…           -29.3             149.
 3 00072b9e-b843… Todiramphus (… https://biodi…           -33.7             151.
 4 00280b9f-8fc4… Todiramphus (… https://biodi…           -27.5             153.
 5 002e666e-a69c… Todiramphus (… https://biodi…           -19.2             147.
 6 002fcf2c-7a6c… Todiramphus (… https://biodi…           -33.1             151.
 7 003327aa-c684… Todiramphus (… https://biodi…           -36.5             147.
 8 0036208a-7764… Todiramphus (… https://biodi…           -32.8             117.
 9 0036fdd3-947e… Todiramphus (… https://biodi…           -35.3             149.
10 003ba830-1f22… Todiramphus (… https://biodi…           -27.3             153.
# ℹ 9,841 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
#   cl22 <chr>, month <dbl>

[[11]]
# A tibble: 1,006 × 13
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 0034efe7-4a01… Todiramphus    https://biodi…           -27.2             153.
 2 00478798-95f7… Todiramphus    https://biodi…           -28.2             154.
 3 0064cb36-4eee… Todiramphus    https://biodi…           -27.5             153.
 4 00bcefdb-f852… Todiramphus    https://biodi…           -25.6             153.
 5 00ebd2bb-0c34… Todiramphus    https://biodi…           -27.3             153.
 6 0116442d-a7f6… Todiramphus    https://biodi…           -16.9             146.
 7 01d19d01-c721… Todiramphus    https://biodi…           -27.5             153.
 8 020b9283-447f… Todiramphus    https://biodi…           -16.9             146.
 9 02511840-3813… Todiramphus    https://biodi…           -16.9             146.
10 03a9e4bd-37db… Todiramphus    https://biodi…           -25.6             153.
# ℹ 996 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
#   cl22 <chr>, month <dbl>

…and use purrr::map()1 to remove duplicates for each species group, binding our dataframes together again with bind_rows().

library(purrr)

Attaching package: 'purrr'
The following object is masked from 'package:base':

    %||%
birds |>
  group_split(species) |>
  map(\(df) 
      df |> 
        filter(duplicated(decimalLongitude) & duplicated(decimalLatitude))
      ) |>
  bind_rows()
# A tibble: 23,776 × 13
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 04b5c741-1afb… Ceyx azureus   https://biodi…           -37.8             145.
 2 06eb5cd7-413f… Ceyx azureus   https://biodi…           -34.5             151.
 3 083ec28b-68d3… Ceyx azureus   https://biodi…           -26.3             153.
 4 087d63fc-2505… Ceyx azureus   https://biodi…           -27.5             153.
 5 0afd32d4-c759… Ceyx azureus   https://biodi…           -37.8             145.
 6 0b2b6aab-1283… Ceyx azureus   https://biodi…           -28.8             154.
 7 0b8ea27e-2ca2… Ceyx azureus   https://biodi…           -34.5             151.
 8 0ba0afc4-1cf2… Ceyx azureus   https://biodi…           -27.3             153.
 9 0c570ead-3759… Ceyx azureus   https://biodi…           -33.0             151.
10 0c9f8e48-1c44… Ceyx azureus   https://biodi…           -27.3             153.
# ℹ 23,766 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
#   cl22 <chr>, month <dbl>

Splitting by species has reduced the total number of duplicate records by ~3,500 rows because we’ve made it possible for multiple species to have records with the same spatial coordinates.

4.2 Remove duplicates

To now remove these duplicates from our dataframe, we can use the ! operator to return records that are not duplicated, rather than those that are.

birds_filtered <- birds |>
  group_split(species) |>
  map(\(df) 
      df |>
        filter(!duplicated(decimalLongitude) & !duplicated(decimalLatitude))) |>
  bind_rows()
birds_filtered
# A tibble: 22,011 × 13
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 0007f679-255f… Ceyx azureus   https://biodi…           -35.3             149.
 2 00237f72-7b95… Ceyx azureus   https://biodi…           -33.6             151.
 3 002d4683-fdeb… Ceyx azureus   https://biodi…           -22.8             151.
 4 0030b417-ad83… Ceyx azureus   https://biodi…           -23.5             151.
 5 005c21b0-3066… Ceyx azureus   https://biodi…           -16.2             145.
 6 00671765-ed23… Ceyx azureus   https://biodi…           -36.1             145.
 7 0086cd20-926a… Ceyx azureus   https://biodi…           -35.6             150.
 8 00a29f49-65aa… Ceyx azureus   https://biodi…           -34.9             151.
 9 00a39798-5118… Ceyx azureus   https://biodi…           -17.1             146.
10 00c02d79-58ef… Ceyx azureus   https://biodi…           -16.2             145.
# ℹ 22,001 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
#   cl22 <chr>, month <dbl>

To check our results, we can grab a random row from our unfiltered dataframe…

test_row <- birds |>
  filter(duplicated(decimalLongitude) & duplicated(decimalLatitude)) |>
  slice(10)

test_row |>
  select(species, decimalLatitude, decimalLongitude, recordID) # show relevant columns
# A tibble: 1 × 4
  species             decimalLatitude decimalLongitude recordID                 
  <chr>                         <dbl>            <dbl> <chr>                    
1 Todiramphus sanctus           -19.2             147. 01006006-78b2-4d19-bb99-…

…and see whether any rows in birds_filtered have the same combination of longitude and latitude coordinates.

birds_filtered |>
  filter(
    decimalLatitude %in% test_row$decimalLatitude & 
      decimalLongitude %in% test_row$decimalLongitude
    ) |>
  select(species, decimalLatitude, decimalLongitude, recordID) # show relevant columns
# A tibble: 4 × 4
  species               decimalLatitude decimalLongitude recordID               
  <chr>                           <dbl>            <dbl> <chr>                  
1 Dacelo leachii                  -19.2             147. 0bac3731-116a-4f7b-8b2…
2 Dacelo novaeguineae             -19.2             147. 161a96bb-e9af-4a6b-ae3…
3 Todiramphus macleayii           -19.2             147. 0208aa99-b3ce-449a-99d…
4 Todiramphus sanctus             -19.2             147. 00c3f294-01dd-4f66-b8a…

As expected, there are a few species with those latitude and longitude coordinates, but we now only have 1 row for each species in that location in birds_filtered.

Using %in% can be a powerful tool for finding duplicates in your dataframe. Extracting rows like we did above with our test_row example above (or a list of values in a column) can help you weed out more specific duplicate records you are interested in.

Our kingfisher data, birds_filtered, is now clean from spatially duplicated records!

Code
birds_filtered |>
  rmarkdown::paged_table()

4.3 Summary

This chapter has introduced some ways to find duplicated records, remove them from datasets, and check if the changes were correctly made. These methods can be more broadly applied to other types of data as well, not just spatial data. Depending on your analysis, you may need to use bespoke methods for handling duplicates. Later chapters like Taxonomic validation and Geospatial cleaning cover more advanced detection and cleaning methods.

In the next chapter, we will discuss ways of handling missing values in your dataset.


  1. We have used \(df) as shorthand within purrr::map(). This shorthand can be rewritten as map(.x = df, function(.x) {}).

    We provide an input, in this case the piped dataframe which we’ve called df, and use it in a custom function (defined within {}). This function is run over each dataframe in our list of dataframes.

    Check out this description from a recent purrr package update for another example.↩︎