# packages
library(galah)
library(dplyr)
library(janitor)
galah_config(email = "your-email-here") # ALA-registered email
<- galah_call() |>
birds filter(doi == "https://doi.org /10.26197/ala.37497b54-2bcb-4d47-bf43-823ee137d816") |>
atlas_occurrences()
4 Duplicates
Duplicate records can occur for a number of reasons. For instance, a duplicate record might appear in an individual dataset due to errors in data collection or entry, or occur when aggregating multiple data sources. Alternatively, a record might be considered a duplicate in the context of one type of analysis, but not another. For example, prior to running species distribution models, records in the same location—even if they are separate observations—are considered duplicates and should be removed to avoid spatial bias. If you’re running multiple models for several time-periods, however, you may need to include records in the same location if they occurred in different time-periods. Context is key when determining how to identify and clean duplicate records in your dataset.
Identifying duplicates is important to avoid misleading analyses or visualisations. Duplicates can give the impression that there are more data than there really are and bias your analyses to favour certain species, locations, or time periods. In this chapter we will introduce ways of detecting and handling duplicate records in biodiversity data.
4.0.1 Prerequisites
In this chapter, we will use kingfisher (Alcedinidae) occurrence data in 2023 from the ALA.
4.1 Find duplicates
As an first example, let’s remove all spatially-duplicated records, based on latitude and longitude coordinate values.
The first thing to do is find the duplicate records.
Return a summary of the number of duplicates for each set of coordinates.
|>
birds group_by(decimalLongitude, decimalLatitude) |>
filter(n() > 1) |>
summarise(n = n(), .groups = "drop")
# A tibble: 5,263 × 3
decimalLongitude decimalLatitude n
<dbl> <dbl> <int>
1 115. -34.1 2
2 115. -33.8 2
3 115. -33.7 2
4 115. -33.6 2
5 115. -34.0 2
6 115. -33.9 12
7 115. -33.9 2
8 115. -33.6 2
9 115. -33.6 2
10 115. -33.6 2
# ℹ 5,253 more rows
Return a summary of duplicate decimal longitude and latitude rows in the entire dataset.
|>
birds filter(duplicated(decimalLongitude) & duplicated(decimalLatitude))
# A tibble: 27,297 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 0036fdd3-947e… Todiramphus (… https://biodi… -35.3 149.
2 006cea57-9ec8… Dacelo (Dacel… https://biodi… -33.9 151.
3 007a75a3-aba2… Dacelo (Dacel… https://biodi… -27.5 153.
4 007b39cf-2660… Todiramphus (… https://biodi… -27.4 153.
5 0090348a-41bb… Todiramphus (… https://biodi… -27.4 153.
6 009dd9a8-e12b… Todiramphus (… https://biodi… -27.4 153.
7 00bde3fa-6095… Dacelo (Dacel… https://biodi… -19.4 147.
8 00e8c718-f305… Dacelo (Dacel… https://biodi… -31.7 116.
9 00fe93fc-86b9… Dacelo (Dacel… https://biodi… -27.4 153.
10 01006006-78b2… Todiramphus (… https://biodi… -19.2 147.
# ℹ 27,287 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
Return duplicated rows and the number of duplicates of decimalLatitude
OR decimalLongitude
(note that this differs from the dplyr example because janitor uses commas as an OR statement).
|>
birds get_dupes(decimalLatitude, decimalLongitude)
# A tibble: 32,533 × 14
decimalLatitude decimalLongitude dupe_count recordID scientificName
<dbl> <dbl> <int> <chr> <chr>
1 -27.4 153. 686 00018120-2238-412… Dacelo (Dacel…
2 -27.4 153. 686 007b39cf-2660-415… Todiramphus (…
3 -27.4 153. 686 0090348a-41bb-4f3… Todiramphus (…
4 -27.4 153. 686 009dd9a8-e12b-45d… Todiramphus (…
5 -27.4 153. 686 00fe93fc-86b9-4cd… Dacelo (Dacel…
6 -27.4 153. 686 017e052c-dd85-43d… Todiramphus (…
7 -27.4 153. 686 01bda731-b532-43c… Dacelo (Dacel…
8 -27.4 153. 686 023f2c30-93a9-4d6… Dacelo (Dacel…
9 -27.4 153. 686 03d44a01-7863-469… Todiramphus (…
10 -27.4 153. 686 03dee6d8-4ae9-46a… Todiramphus (…
# ℹ 32,523 more rows
# ℹ 9 more variables: taxonConceptID <chr>, eventDate <dttm>,
# occurrenceStatus <chr>, dataResourceName <chr>, family <chr>, genus <chr>,
# species <chr>, cl22 <chr>, month <dbl>
In the above tibble
our results show that there are just over 27,000 records that overlap spatially with duplicate coordinates. That seems like a lot! It would be rare to remove duplicates so broadly without considering why we need to remove duplicates; we don’t necessarily want to remove all of them.
Instead, if we are interested in comparing species in our data, it might be more useful to find duplicate spatial records for each species. We can split our data by species and remove records where there is more than one observation of the same species in the same location. This should leave one observation for each species in each location.
To filter our duplicate data by species, we can first split our data by species…
|>
birds group_split(species)
<list_of<
tbl_df<
recordID : character
scientificName : character
taxonConceptID : character
decimalLatitude : double
decimalLongitude: double
eventDate : datetime<UTC>
occurrenceStatus: character
dataResourceName: character
family : character
genus : character
species : character
cl22 : character
month : double
>
>[11]>
[[1]]
# A tibble: 2,080 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 0007f679-255f… Ceyx azureus https://biodi… -35.3 149.
2 00237f72-7b95… Ceyx azureus https://biodi… -33.6 151.
3 002d4683-fdeb… Ceyx azureus https://biodi… -22.8 151.
4 0030b417-ad83… Ceyx azureus https://biodi… -23.5 151.
5 005c21b0-3066… Ceyx azureus https://biodi… -16.2 145.
6 00671765-ed23… Ceyx azureus https://biodi… -36.1 145.
7 0086cd20-926a… Ceyx azureus https://biodi… -35.6 150.
8 00a29f49-65aa… Ceyx azureus https://biodi… -34.9 151.
9 00a39798-5118… Ceyx azureus https://biodi… -17.1 146.
10 00c02d79-58ef… Ceyx azureus https://biodi… -16.2 145.
# ℹ 2,070 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[2]]
# A tibble: 198 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 00b3fbaf-20d1… Ceyx pusillus https://biodi… -12.3 131.
2 02ec2065-820c… Ceyx pusillus https://biodi… -12.4 131.
3 02f2b1e9-f706… Ceyx pusillus https://biodi… -16.9 146.
4 030dd9ae-3f67… Ceyx pusillus https://biodi… -12.4 131.
5 03a15b6a-b666… Ceyx pusillus https://biodi… -16.9 146.
6 04c838f4-8641… Ceyx pusillus https://biodi… -12.4 131.
7 058ef10d-247b… Ceyx pusillus https://biodi… -12.6 131.
8 078b046a-9cf3… Ceyx pusillus https://biodi… -12.8 143.
9 0791d280-a26c… Ceyx pusillus https://biodi… -12.4 131.
10 0843ea42-367f… Ceyx pusillus https://biodi… -16.1 145.
# ℹ 188 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[3]]
# A tibble: 1,379 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 00085f92-58d1… Dacelo (Dacel… https://biodi… -14.2 132.
2 005e94f6-d0d7… Dacelo (Dacel… https://biodi… -12.7 143.
3 0083fbfd-3f14… Dacelo (Dacel… https://biodi… -17.5 141.
4 00bd28e3-20aa… Dacelo (Dacel… https://biodi… -13.4 132.
5 00bde3fa-6095… Dacelo (Dacel… https://biodi… -19.4 147.
6 00c8ff2c-9282… Dacelo (Dacel… https://biodi… -16.7 146.
7 01462284-b0aa… Dacelo (Dacel… https://biodi… -12.4 131.
8 01db4718-55f5… Dacelo (Dacel… https://biodi… -16.5 145.
9 0268d927-927e… Dacelo (Dacel… https://biodi… -14.5 132.
10 02712b54-8b8f… Dacelo (Dacel… https://biodi… -19.4 147.
# ℹ 1,369 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[4]]
# A tibble: 28,186 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 00018120-2238… Dacelo (Dacel… https://biodi… -27.4 153.
2 0005aa70-a30f… Dacelo (Dacel… https://biodi… -25.3 153.
3 0006ff02-853a… Dacelo (Dacel… https://biodi… -37.9 145.
4 00076c8d-957e… Dacelo (Dacel… https://biodi… -28.0 153.
5 00080e38-ee2d… Dacelo (Dacel… https://biodi… -33.1 150.
6 000c7c7c-3603… Dacelo (Dacel… https://biodi… -38.5 144.
7 000f692d-8013… Dacelo (Dacel… https://biodi… -19.3 147.
8 000f87c2-9028… Dacelo (Dacel… https://biodi… -35.3 149.
9 0010012d-435e… Dacelo (Dacel… https://biodi… -16.8 146.
10 001021bd-b8f1… Dacelo (Dacel… https://biodi… -27.5 153.
# ℹ 28,176 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[5]]
# A tibble: 155 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 0120314d-38b4… Syma torotoro https://biodi… -12.6 143.
2 04b2e915-4419… Syma torotoro https://biodi… -12.8 143.
3 04d3acf0-acdb… Syma torotoro https://biodi… -12.7 143.
4 0bf14fe4-01fc… Syma torotoro https://biodi… -12.7 143.
5 0c0a3bc9-431f… Syma torotoro https://biodi… -12.7 143.
6 0e428bdf-d6fb… Syma torotoro https://biodi… -10.8 142.
7 0f29d245-adc8… Syma torotoro https://biodi… -12.6 143.
8 0ff604e2-fbe3… Syma torotoro https://biodi… -10.8 142.
9 1121aa01-3173… Syma torotoro https://biodi… -12.8 143.
10 131d66ad-46d8… Syma torotoro https://biodi… -12.8 143.
# ℹ 145 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[6]]
# A tibble: 566 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 004c61ec-ed26… Tanysiptera (… https://biodi… -16.6 145.
2 014b3666-a322… Tanysiptera (… https://biodi… -16.6 145.
3 0152bed4-9459… Tanysiptera (… https://biodi… -12.8 143.
4 01685613-fe2f… Tanysiptera (… https://biodi… -12.7 143.
5 019804dd-c3fc… Tanysiptera (… https://biodi… -16.6 145.
6 02a51caa-861f… Tanysiptera (… https://biodi… -12.7 143.
7 02f21d71-27c2… Tanysiptera (… https://biodi… -10.8 142.
8 02f96918-b627… Tanysiptera (… https://biodi… -12.7 143.
9 044df2a6-7d32… Tanysiptera (… https://biodi… -16.2 145.
10 04baf9c0-fd82… Tanysiptera (… https://biodi… -16.6 145.
# ℹ 556 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[7]]
# A tibble: 12 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 3aa845a0-902e… Todiramphus (… https://biodi… 1.32 104.
2 5f59e40a-87d2… Todiramphus (… https://biodi… 1.28 104.
3 789e7794-af27… Todiramphus (… https://biodi… 1.32 104.
4 7c6e2ffa-73a8… Todiramphus (… https://biodi… -28.2 154.
5 806d0434-8d8f… Todiramphus (… https://biodi… 1.28 104.
6 83d392d7-a15f… Todiramphus (… https://biodi… 1.28 104.
7 85bd0100-ea5d… Todiramphus (… https://biodi… 1.28 104.
8 aeda1763-cd85… Todiramphus (… https://biodi… 1.28 104.
9 dac72d12-bf2c… Todiramphus (… https://biodi… 1.28 104.
10 e156dd38-95f5… Todiramphus (… https://biodi… 1.28 104.
11 e5574549-7cec… Todiramphus (… https://biodi… 1.28 104.
12 fd6435b4-32d6… Todiramphus (… https://biodi… 1.31 104.
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[8]]
# A tibble: 2,374 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 0024aef4-4a8b… Todiramphus (… https://biodi… -19.4 147.
2 004b8b6c-8a89… Todiramphus (… https://biodi… -12.4 131.
3 007b39cf-2660… Todiramphus (… https://biodi… -27.4 153.
4 00993728-10d2… Todiramphus (… https://biodi… -27.1 153.
5 009e7664-1b3e… Todiramphus (… https://biodi… -27.4 153.
6 00b1c419-0612… Todiramphus (… https://biodi… -26.2 153.
7 00cdcbdf-36a6… Todiramphus (… https://biodi… -12.5 131.
8 00deabe0-1d59… Todiramphus (… https://biodi… -12.3 131.
9 0120a6ea-66f1… Todiramphus (… https://biodi… -30.4 153.
10 012e4a46-ead1… Todiramphus (… https://biodi… -20.0 146.
# ℹ 2,364 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[9]]
# A tibble: 296 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 001c48ef-8329… Todiramphus (… https://biodi… -31.9 142.
2 016ec933-df3e… Todiramphus (… https://biodi… -23.7 134.
3 01833f43-b3d5… Todiramphus (… https://biodi… -28.0 146.
4 01e0d185-50ed… Todiramphus (… https://biodi… -29.8 151.
5 02d54da5-cfb0… Todiramphus (… https://biodi… -19.0 146.
6 033bf94d-cd11… Todiramphus (… https://biodi… -20.0 140.
7 06d7a1fc-9464… Todiramphus (… https://biodi… -23.5 144.
8 06e22048-0569… Todiramphus (… https://biodi… -18.3 143.
9 07fc074c-fff9… Todiramphus (… https://biodi… -31.9 141.
10 09e67cf4-3fbf… Todiramphus (… https://biodi… -17.7 140.
# ℹ 286 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[10]]
# A tibble: 9,871 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 00030a7e-a6cc… Todiramphus (… https://biodi… -35.1 147.
2 0005e748-e148… Todiramphus (… https://biodi… -29.3 149.
3 00072b9e-b843… Todiramphus (… https://biodi… -33.7 151.
4 00280b9f-8fc4… Todiramphus (… https://biodi… -27.5 153.
5 002e666e-a69c… Todiramphus (… https://biodi… -19.2 147.
6 002fcf2c-7a6c… Todiramphus (… https://biodi… -33.1 151.
7 003327aa-c684… Todiramphus (… https://biodi… -36.5 147.
8 0036208a-7764… Todiramphus (… https://biodi… -32.8 117.
9 0036fdd3-947e… Todiramphus (… https://biodi… -35.3 149.
10 003ba830-1f22… Todiramphus (… https://biodi… -27.3 153.
# ℹ 9,861 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[11]]
# A tibble: 1,008 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 0034efe7-4a01… Todiramphus https://biodi… -27.2 153.
2 00478798-95f7… Todiramphus https://biodi… -28.2 154.
3 0064cb36-4eee… Todiramphus https://biodi… -27.5 153.
4 00bcefdb-f852… Todiramphus https://biodi… -25.6 153.
5 00ebd2bb-0c34… Todiramphus https://biodi… -27.3 153.
6 0116442d-a7f6… Todiramphus https://biodi… -16.9 146.
7 01d19d01-c721… Todiramphus https://biodi… -27.5 153.
8 020b9283-447f… Todiramphus https://biodi… -16.9 146.
9 02511840-3813… Todiramphus https://biodi… -16.9 146.
10 03a9e4bd-37db… Todiramphus https://biodi… -25.6 153.
# ℹ 998 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
…and use purrr::map()
1 to remove duplicates for each species group, binding our dataframes together again with bind_rows()
.
library(purrr)
Attaching package: 'purrr'
The following object is masked from 'package:base':
%||%
|>
birds group_split(species) |>
map(\(df)
|>
df filter(duplicated(decimalLongitude) & duplicated(decimalLatitude))
|>
) bind_rows()
# A tibble: 23,788 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 04b5c741-1afb… Ceyx azureus https://biodi… -37.8 145.
2 06eb5cd7-413f… Ceyx azureus https://biodi… -34.5 151.
3 083ec28b-68d3… Ceyx azureus https://biodi… -26.3 153.
4 087d63fc-2505… Ceyx azureus https://biodi… -27.5 153.
5 0afd32d4-c759… Ceyx azureus https://biodi… -37.8 145.
6 0b2b6aab-1283… Ceyx azureus https://biodi… -28.8 154.
7 0b8ea27e-2ca2… Ceyx azureus https://biodi… -34.5 151.
8 0ba0afc4-1cf2… Ceyx azureus https://biodi… -27.3 153.
9 0c570ead-3759… Ceyx azureus https://biodi… -33.0 151.
10 0c9f8e48-1c44… Ceyx azureus https://biodi… -27.3 153.
# ℹ 23,778 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
Splitting by species has reduced the total number of duplicate records by ~3,500 rows because we’ve made it possible for multiple species to have records with the same spatial coordinates.
4.2 Remove duplicates
To now remove these duplicates from our dataframe, we can use the !
operator to return records that are not duplicated, rather than those that are.
<- birds |>
birds_filtered group_split(species) |>
map(\(df)
|>
df filter(!duplicated(decimalLongitude) & !duplicated(decimalLatitude))) |>
bind_rows()
birds_filtered
# A tibble: 22,128 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 0007f679-255f… Ceyx azureus https://biodi… -35.3 149.
2 00237f72-7b95… Ceyx azureus https://biodi… -33.6 151.
3 002d4683-fdeb… Ceyx azureus https://biodi… -22.8 151.
4 0030b417-ad83… Ceyx azureus https://biodi… -23.5 151.
5 005c21b0-3066… Ceyx azureus https://biodi… -16.2 145.
6 00671765-ed23… Ceyx azureus https://biodi… -36.1 145.
7 0086cd20-926a… Ceyx azureus https://biodi… -35.6 150.
8 00a29f49-65aa… Ceyx azureus https://biodi… -34.9 151.
9 00a39798-5118… Ceyx azureus https://biodi… -17.1 146.
10 00c02d79-58ef… Ceyx azureus https://biodi… -16.2 145.
# ℹ 22,118 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
To check our results, we can grab a random row from our unfiltered dataframe…
<- birds |>
test_row filter(duplicated(decimalLongitude) & duplicated(decimalLatitude)) |>
slice(10)
|>
test_row select(species, decimalLatitude, decimalLongitude, recordID) # show relevant columns
# A tibble: 1 × 4
species decimalLatitude decimalLongitude recordID
<chr> <dbl> <dbl> <chr>
1 Todiramphus sanctus -19.2 147. 01006006-78b2-4d19-bb99-…
…and see whether any rows in birds_filtered
have the same combination of longitude and latitude coordinates.
|>
birds_filtered filter(
%in% test_row$decimalLatitude &
decimalLatitude %in% test_row$decimalLongitude
decimalLongitude |>
) select(species, decimalLatitude, decimalLongitude, recordID) # show relevant columns
# A tibble: 4 × 4
species decimalLatitude decimalLongitude recordID
<chr> <dbl> <dbl> <chr>
1 Dacelo leachii -19.2 147. 0bac3731-116a-4f7b-8b2…
2 Dacelo novaeguineae -19.2 147. 161a96bb-e9af-4a6b-ae3…
3 Todiramphus macleayii -19.2 147. 0208aa99-b3ce-449a-99d…
4 Todiramphus sanctus -19.2 147. 00c3f294-01dd-4f66-b8a…
As expected, there are a few species with those latitude and longitude coordinates, but we now only have 1 row for each species in that location in birds_filtered
.
Using %in%
can be a powerful tool for finding duplicates in your dataframe. Extracting rows like we did above with our test_row
example above (or a list of values in a column) can help you weed out more specific duplicate records you are interested in.
Our kingfisher data, birds_filtered
, is now clean from spatially duplicated records!
Code
|>
birds_filtered ::paged_table() rmarkdown