# packages
library(galah)
library(dplyr)
library(janitor)
galah_config(email = "your-email-here") # ALA-registered email
<- galah_call() |>
birds filter(doi == "https://doi.org/10.26197/ala.43c590cb-79f5-4be4-94fe-798adb8fd9b4") |>
atlas_occurrences()
4 Duplicates
Duplicate records can occur for a number of reasons. For instance, a duplicate record might appear in an individual dataset due to errors in data collection or entry, or occur when aggregating multiple data sources. Alternatively, a record might be considered a duplicate in the context of one type of analysis, but not another. For example, prior to running species distribution models, records in the same location—even if they are separate observations—are considered duplicates and should be removed to avoid spatial bias. If you’re running multiple models for several time-periods, however, you may need to include records in the same location if they occurred in different time-periods. Context is key when determining how to identify and clean duplicate records in your dataset.
Identifying duplicates is important to avoid misleading analyses or visualisations. Duplicates can give the impression that there are more data than there really are and bias your analyses to favour certain species, locations, or time periods. In this chapter we will introduce ways of detecting and handling duplicate records in biodiversity data.
4.0.1 Prerequisites
In this chapter, we will use kingfisher (Alcedinidae) occurrence data in 2023 from the ALA.
4.1 Find duplicates
As an first example, let’s remove all spatially-duplicated records, based on latitude and longitude coordinate values.
The first thing to do is find the duplicate records.
Return a summary of the number of duplicates for each set of coordinates.
|>
birds group_by(decimalLongitude, decimalLatitude) |>
filter(n() > 1) |>
summarise(n = n(), .groups = "drop")
# A tibble: 15,760 × 3
decimalLongitude decimalLatitude n
<dbl> <dbl> <int>
1 114. -24.9 2
2 114. -24.9 2
3 114. -24.9 4
4 114. -24.9 2
5 114. -24.9 3
6 114. -25.1 2
7 114. -24.8 2
8 114. -22.0 5
9 114. -22.0 2
10 114. -22.0 4
# ℹ 15,750 more rows
Return a summary of duplicate decimal longitude and latitude rows in the entire dataset.
|>
birds filter(duplicated(decimalLongitude) & duplicated(decimalLatitude))
# A tibble: 135,263 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 00072b9e-b843… Todiramphus (… https://biodi… -33.7 151.
2 001740ce-c2ed… Todiramphus (… https://biodi… -12.9 133.
3 00280b9f-8fc4… Todiramphus (… https://biodi… -27.5 153.
4 002ffe45-9e4d… Dacelo (Dacel… https://biodi… -34.3 151.
5 0036fdd3-947e… Todiramphus (… https://biodi… -35.3 149.
6 00396f3a-e3a4… Todiramphus (… https://biodi… -27.5 153.
7 004198eb-80e8… Ceyx azureus https://biodi… -12.9 133.
8 004768fe-5319… Todiramphus (… https://biodi… -33.7 151.
9 00493c37-f535… Todiramphus (… https://biodi… -19.3 147.
10 00495552-a826… Todiramphus (… https://biodi… -32.4 152.
# ℹ 135,253 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
Return duplicated rows and the number of duplicates of decimalLatitude
OR decimalLongitude
(note that this differs from the dplyr example because janitor uses commas as an OR statement).
|>
birds get_dupes(decimalLatitude, decimalLongitude)
# A tibble: 150,994 × 14
decimalLatitude decimalLongitude dupe_count recordID scientificName
<dbl> <dbl> <int> <chr> <chr>
1 -27.4 153. 2132 00018120-2238-412… Dacelo (Dacel…
2 -27.4 153. 2132 006475ff-87b0-462… Todiramphus (…
3 -27.4 153. 2132 007b39cf-2660-415… Todiramphus (…
4 -27.4 153. 2132 0090348a-41bb-4f3… Todiramphus (…
5 -27.4 153. 2132 009c5758-624d-4c8… Dacelo (Dacel…
6 -27.4 153. 2132 009dd9a8-e12b-45d… Todiramphus (…
7 -27.4 153. 2132 00b32686-a8c8-427… Dacelo (Dacel…
8 -27.4 153. 2132 00fe93fc-86b9-4cd… Dacelo (Dacel…
9 -27.4 153. 2132 010eb7a8-d969-4b2… Todiramphus (…
10 -27.4 153. 2132 0160b334-d620-465… Ceyx azureus
# ℹ 150,984 more rows
# ℹ 9 more variables: taxonConceptID <chr>, eventDate <dttm>,
# occurrenceStatus <chr>, dataResourceName <chr>, family <chr>, genus <chr>,
# species <chr>, cl22 <chr>, month <dbl>
In the above tibble
our results show that there are just over 27,000 records that overlap spatially with duplicate coordinates. That seems like a lot! It would be rare to remove duplicates so broadly without considering why we need to remove duplicates; we don’t necessarily want to remove all of them.
Instead, if we are interested in comparing species in our data, it might be more useful to find duplicate spatial records for each species. We can split our data by species and remove records where there is more than one observation of the same species in the same location. This should leave one observation for each species in each location.
To filter our duplicate data by species, we can first split our data by species…
group_split()
is experimental
dplyr::group_split()
is an amazing function and makes any split operation very easy, especially in a pipe. However, the documentation of group_split()
states that it’s an experimental function, “is not stable”, and “may be deprecated in the future.” If you are concerned about longevity of your code, you can use base::split()
. An equivalent way to do the split part is:
# limiting to first 10 rows for example purposes
<- birds[1:10,]
birds_n10
# split by species name
|>
birds_n10 split(birds_n10$species)
$`Dacelo novaeguineae`
# A tibble: 8 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 00006a53-0c17-… Dacelo (Dacel… https://biodi… -33.9 151.
2 00016be4-6e8b-… Dacelo (Dacel… https://biodi… -36.0 145.
3 00017457-07a7-… Dacelo (Dacel… https://biodi… -33.8 151.
4 00018120-2238-… Dacelo (Dacel… https://biodi… -27.4 153.
5 0002f310-faff-… Dacelo (Dacel… https://biodi… -33.7 151.
6 00030252-79df-… Dacelo (Dacel… https://biodi… -27.3 153.
7 0003a585-136c-… Dacelo (Dacel… https://biodi… -30.9 153.
8 000400e3-8aed-… Dacelo (Dacel… https://biodi… -35.5 149.
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
$`Todiramphus sanctus`
# A tibble: 2 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 0001ab91-d17a-… Todiramphus (… https://biodi… -36.2 146.
2 00030a7e-a6cc-… Todiramphus (… https://biodi… -35.1 147.
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
However, note that order of variables is not preserved in this base solution and data will need to be reordered if we want to match group_split()
output exactly (this is the complicated part, which makes this option not as nice/convenient as group_split()
).
|>
birds group_split(species)
<list_of<
tbl_df<
recordID : character
scientificName : character
taxonConceptID : character
decimalLatitude : double
decimalLongitude: double
eventDate : datetime<UTC>
occurrenceStatus: character
dataResourceName: character
family : character
genus : character
species : character
cl22 : character
month : double
>
>[11]>
[[1]]
# A tibble: 8,205 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 0007f679-255f… Ceyx azureus https://biodi… -35.3 149.
2 001c3405-83cb… Ceyx azureus https://biodi… -27.6 153.
3 0021a3e1-a4ab… Ceyx azureus https://biodi… -35.7 144.
4 00237f72-7b95… Ceyx azureus https://biodi… -33.6 151.
5 002c006d-c3fc… Ceyx azureus https://biodi… -12.9 133.
6 002d4683-fdeb… Ceyx azureus https://biodi… -22.8 151.
7 002e02ae-9462… Ceyx azureus https://biodi… -26.3 153.
8 0030b417-ad83… Ceyx azureus https://biodi… -23.5 151.
9 004134c8-9a61… Ceyx azureus https://biodi… -33.6 151.
10 004198eb-80e8… Ceyx azureus https://biodi… -12.9 133.
# ℹ 8,195 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[2]]
# A tibble: 929 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 00b3fbaf-20d1… Ceyx pusillus https://biodi… -12.3 131.
2 013ce8f4-edd4… Ceyx pusillus https://biodi… -18.7 146.
3 0146c29b-861c… Ceyx pusillus https://biodi… -16.9 146.
4 01785024-7b33… Ceyx pusillus https://biodi… -16.4 145.
5 01cf73af-ef2e… Ceyx pusillus https://biodi… -16.3 145.
6 0238883f-1a78… Ceyx pusillus https://biodi… -16.8 146.
7 02627174-f767… Ceyx pusillus https://biodi… -16.2 145.
8 0269a3b6-878a… Ceyx pusillus https://biodi… -19.2 147.
9 027f4483-679a… Ceyx pusillus https://biodi… -16.9 146.
10 02a92e5e-f62a… Ceyx pusillus https://biodi… -16.9 146.
# ℹ 919 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[3]]
# A tibble: 7,756 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 00085f92-58d1… Dacelo (Dacel… https://biodi… -14.2 132.
2 00094a12-7434… Dacelo (Dacel… https://biodi… -12.4 131.
3 000ce3fc-eecd… Dacelo (Dacel… https://biodi… -12.9 133.
4 000e06de-d33c… Dacelo (Dacel… https://biodi… -12.8 143.
5 00135153-020a… Dacelo (Dacel… https://biodi… -12.3 131.
6 0025a674-b6dd… Dacelo (Dacel… https://biodi… -17.8 122.
7 0029ff35-6a51… Dacelo (Dacel… https://biodi… -12.4 131.
8 003cf680-13e7… Dacelo (Dacel… https://biodi… -13.9 136.
9 00402f16-b7de… Dacelo (Dacel… https://biodi… -23.4 151.
10 0040e041-afc4… Dacelo (Dacel… https://biodi… -12.5 131.
# ℹ 7,746 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[4]]
# A tibble: 104,816 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 00006a53-0c17… Dacelo (Dacel… https://biodi… -33.9 151.
2 00016be4-6e8b… Dacelo (Dacel… https://biodi… -36.0 145.
3 00017457-07a7… Dacelo (Dacel… https://biodi… -33.8 151.
4 00018120-2238… Dacelo (Dacel… https://biodi… -27.4 153.
5 0002f310-faff… Dacelo (Dacel… https://biodi… -33.7 151.
6 00030252-79df… Dacelo (Dacel… https://biodi… -27.3 153.
7 0003a585-136c… Dacelo (Dacel… https://biodi… -30.9 153.
8 000400e3-8aed… Dacelo (Dacel… https://biodi… -35.5 149.
9 00040158-5b2d… Dacelo (Dacel… https://biodi… -27.2 153.
10 00047f1d-20cd… Dacelo (Dacel… https://biodi… -33.7 151.
# ℹ 104,806 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[5]]
# A tibble: 537 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 00094e28-c5d4… Syma torotoro https://biodi… -12.7 143.
2 002588ad-cbb2… Syma torotoro https://biodi… -12.6 142.
3 0120314d-38b4… Syma torotoro https://biodi… -12.6 143.
4 01b1fa95-cdca… Syma torotoro https://biodi… -12.6 143.
5 0224d64b-03e8… Syma torotoro https://biodi… -12.7 143.
6 0258f862-e91d… Syma torotoro https://biodi… -10.8 142.
7 036186ad-eff9… Syma torotoro https://biodi… -12.7 143.
8 03ee928b-6946… Syma torotoro https://biodi… -12.7 143.
9 040cbdc9-7b8e… Syma torotoro https://biodi… -12.7 143.
10 04ae08c8-e3bb… Syma torotoro https://biodi… -12.7 143.
# ℹ 527 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[6]]
# A tibble: 880 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 004c61ec-ed26… Tanysiptera (… https://biodi… -16.6 145.
2 010058e0-868f… Tanysiptera (… https://biodi… -16.6 145.
3 014b3666-a322… Tanysiptera (… https://biodi… -16.6 145.
4 0152bed4-9459… Tanysiptera (… https://biodi… -12.8 143.
5 01685613-fe2f… Tanysiptera (… https://biodi… -12.7 143.
6 019804dd-c3fc… Tanysiptera (… https://biodi… -16.6 145.
7 02a51caa-861f… Tanysiptera (… https://biodi… -12.7 143.
8 02f21d71-27c2… Tanysiptera (… https://biodi… -10.8 142.
9 02f96918-b627… Tanysiptera (… https://biodi… -12.7 143.
10 039d66ec-8c2b… Tanysiptera (… https://biodi… -12.7 143.
# ℹ 870 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[7]]
# A tibble: 16 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 3aa845a0-902e… Todiramphus (… https://biodi… 1.32 104.
2 59f96106-4fcc… Todiramphus (… https://biodi… -33.9 151.
3 5f59e40a-87d2… Todiramphus (… https://biodi… 1.28 104.
4 61350c27-d10c… Todiramphus (… https://biodi… -15.5 145.
5 789e7794-af27… Todiramphus (… https://biodi… 1.32 104.
6 7c6e2ffa-73a8… Todiramphus (… https://biodi… -28.2 154.
7 806d0434-8d8f… Todiramphus (… https://biodi… 1.28 104.
8 83d392d7-a15f… Todiramphus (… https://biodi… 1.28 104.
9 85bd0100-ea5d… Todiramphus (… https://biodi… 1.28 104.
10 aeda1763-cd85… Todiramphus (… https://biodi… 1.28 104.
11 dac72d12-bf2c… Todiramphus (… https://biodi… 1.28 104.
12 e156dd38-95f5… Todiramphus (… https://biodi… 1.28 104.
13 e4387c95-12bb… Todiramphus (… https://biodi… -28.2 154.
14 e5574549-7cec… Todiramphus (… https://biodi… 1.28 104.
15 fb434a9e-92e8… Todiramphus (… https://biodi… -28.5 154.
16 fd6435b4-32d6… Todiramphus (… https://biodi… 1.31 104.
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[8]]
# A tibble: 11,491 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 00047d96-6f95… Todiramphus (… https://biodi… -12.4 133.
2 00089b20-396d… Todiramphus (… https://biodi… -12.3 131.
3 001740ce-c2ed… Todiramphus (… https://biodi… -12.9 133.
4 00196c89-1bf7… Todiramphus (… https://biodi… -27.3 153.
5 0021f1a4-54f2… Todiramphus (… https://biodi… -16.8 146.
6 0022c504-5d69… Todiramphus (… https://biodi… -27.1 153.
7 0024aef4-4a8b… Todiramphus (… https://biodi… -19.4 147.
8 004b8b6c-8a89… Todiramphus (… https://biodi… -12.4 131.
9 004faee9-0f3c… Todiramphus (… https://biodi… -12.3 131.
10 0056703b-c4d2… Todiramphus (… https://biodi… -17.3 146.
# ℹ 11,481 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[9]]
# A tibble: 3,045 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 001c48ef-8329… Todiramphus (… https://biodi… -31.9 142.
2 00236c61-1cd0… Todiramphus (… https://biodi… -12.4 131.
3 00821760-1944… Todiramphus (… https://biodi… -22.7 131.
4 008f4ee7-dc49… Todiramphus (… https://biodi… -22.5 143.
5 00bc94a7-8079… Todiramphus (… https://biodi… -31.9 141.
6 00c129db-e68f… Todiramphus (… https://biodi… -34.1 151.
7 00c7eaaf-0061… Todiramphus (… https://biodi… -12.6 131.
8 00d80d10-940d… Todiramphus (… https://biodi… -29.4 142.
9 00e3d9a2-65de… Todiramphus (… https://biodi… -20.7 140.
10 00ec3e4b-2f02… Todiramphus (… https://biodi… -31.9 141.
# ℹ 3,035 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[10]]
# A tibble: 35,256 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 0001ab91-d17a… Todiramphus (… https://biodi… -36.2 146.
2 00030a7e-a6cc… Todiramphus (… https://biodi… -35.1 147.
3 00057295-2cc2… Todiramphus (… https://biodi… -12.3 131.
4 0005e748-e148… Todiramphus (… https://biodi… -29.3 149.
5 00072b9e-b843… Todiramphus (… https://biodi… -33.7 151.
6 000ca2e8-54f5… Todiramphus (… https://biodi… -24.8 152.
7 000cc80f-7d01… Todiramphus (… https://biodi… -12.9 133.
8 000fe09c-0bc2… Todiramphus (… https://biodi… -27.5 153.
9 0010996d-dff9… Todiramphus (… https://biodi… -27.4 153.
10 001206f4-5086… Todiramphus (… https://biodi… -27.3 153.
# ℹ 35,246 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
[[11]]
# A tibble: 3,513 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 002e804d-5f9b… Todiramphus https://biodi… -16.9 146.
2 00300f92-57a8… Todiramphus https://biodi… -12.3 131.
3 0034efe7-4a01… Todiramphus https://biodi… -27.2 153.
4 00478798-95f7… Todiramphus https://biodi… -28.2 154.
5 0064cb36-4eee… Todiramphus https://biodi… -27.5 153.
6 0082f31d-ab89… Todiramphus https://biodi… -23.9 151.
7 0084e451-f1cb… Todiramphus https://biodi… -27.4 153.
8 009aaf27-cc8c… Todiramphus https://biodi… -12.4 131.
9 00bcefdb-f852… Todiramphus https://biodi… -25.6 153.
10 00c3f4d5-81e1… Todiramphus https://biodi… -12.4 131.
# ℹ 3,503 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
…and use purrr::map()
1 to remove duplicates for each species group, binding our dataframes together again with bind_rows()
.
library(purrr)
|>
birds group_split(species) |>
map(\(df)
|>
df filter(duplicated(decimalLongitude) & duplicated(decimalLatitude))
|>
) bind_rows()
# A tibble: 124,206 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 004198eb-80e8… Ceyx azureus https://biodi… -12.9 133.
2 00d154d7-f084… Ceyx azureus https://biodi… -36.1 145.
3 012bb77c-67c4… Ceyx azureus https://biodi… -12.9 133.
4 016863ee-fbdb… Ceyx azureus https://biodi… -27.4 153.
5 01821694-f938… Ceyx azureus https://biodi… -33.6 151.
6 01cb3b38-3ebb… Ceyx azureus https://biodi… -16.2 145.
7 02255898-bbec… Ceyx azureus https://biodi… -33.7 151.
8 0258c5cb-daeb… Ceyx azureus https://biodi… -12.9 133.
9 0282fb2f-daac… Ceyx azureus https://biodi… -27.3 153.
10 02a7de07-7df4… Ceyx azureus https://biodi… -27.6 153.
# ℹ 124,196 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
Splitting by species has reduced the total number of duplicate records by ~3,500 rows because we’ve made it possible for multiple species to have records with the same spatial coordinates.
4.2 Remove duplicates
To now remove these duplicates from our dataframe, we can use the !
operator to return records that are not duplicated, rather than those that are.
<- birds |>
birds_filtered group_split(species) |>
map(\(df)
|>
df filter(!duplicated(decimalLongitude) & !duplicated(decimalLatitude))) |>
bind_rows()
birds_filtered
# A tibble: 51,814 × 13
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 0007f679-255f… Ceyx azureus https://biodi… -35.3 149.
2 001c3405-83cb… Ceyx azureus https://biodi… -27.6 153.
3 0021a3e1-a4ab… Ceyx azureus https://biodi… -35.7 144.
4 00237f72-7b95… Ceyx azureus https://biodi… -33.6 151.
5 002c006d-c3fc… Ceyx azureus https://biodi… -12.9 133.
6 002d4683-fdeb… Ceyx azureus https://biodi… -22.8 151.
7 002e02ae-9462… Ceyx azureus https://biodi… -26.3 153.
8 0030b417-ad83… Ceyx azureus https://biodi… -23.5 151.
9 004134c8-9a61… Ceyx azureus https://biodi… -33.6 151.
10 004d0c33-b0cf… Ceyx azureus https://biodi… -34.7 150.
# ℹ 51,804 more rows
# ℹ 8 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, family <chr>, genus <chr>, species <chr>,
# cl22 <chr>, month <dbl>
To check our results, we can grab a random row from our unfiltered dataframe…
<- birds |>
test_row filter(duplicated(decimalLongitude) & duplicated(decimalLatitude)) |>
slice(10)
|>
test_row select(species, decimalLatitude, decimalLongitude, recordID) # show relevant columns
# A tibble: 1 × 4
species decimalLatitude decimalLongitude recordID
<chr> <dbl> <dbl> <chr>
1 Todiramphus sanctus -32.4 152. 00495552-a826-4628-aec4-…
…and see whether any rows in birds_filtered
have the same combination of longitude and latitude coordinates.
|>
birds_filtered filter(
%in% test_row$decimalLatitude &
decimalLatitude %in% test_row$decimalLongitude
decimalLongitude |>
) select(species, decimalLatitude, decimalLongitude, recordID) # show relevant columns
# A tibble: 3 × 4
species decimalLatitude decimalLongitude recordID
<chr> <dbl> <dbl> <chr>
1 Ceyx azureus -32.4 152. 0c49a015-2899-4456-b95b-…
2 Dacelo novaeguineae -32.4 152. 03a9ac6a-ba9a-475b-a5d0-…
3 Todiramphus sanctus -32.4 152. 00389cc3-3e0d-410f-b402-…
As expected, there are a few species with those latitude and longitude coordinates, but we now only have 1 row for each species in that location in birds_filtered
.
Using %in%
can be a powerful tool for finding duplicates in your dataframe. Extracting rows like we did above with our test_row
example above (or a list of values in a column) can help you weed out more specific duplicate records you are interested in.
Our kingfisher data, birds_filtered
, is now clean from spatially duplicated records!
Code
|>
birds_filtered ::paged_table() rmarkdown
4.3 Summary
This chapter has introduced some ways to find duplicated records, remove them from datasets, and check if the changes were correctly made. These methods can be more broadly applied to other types of data as well, not just spatial data. Depending on your analysis, you may need to use bespoke methods for handling duplicates. Later chapters like Taxonomic validation and Geospatial cleaning cover more advanced detection and cleaning methods.
In the next chapter, we will discuss ways of handling missing values in your dataset.
We have used
\(df)
as shorthand withinpurrr::map()
. This shorthand can be rewritten asmap(.x = df, function(.x) {})
.
We provide an input, in this case the piped dataframe which we’ve calleddf
, and use it in a custom function (defined within{}
). This function is run over each dataframe in our list of dataframes.
Check out this description from a recent purrr package update for another example.↩︎