# packages
library(galah)
library(here)
library(readr)
galah_config(email = "your-email-here") # ALA-registered email
galah_call() |>
filter(doi == "https://doi.org /10.26197/ala.4730caca-0570-4e63-9652-22f92d0b2e1a") |>
atlas_occurrences() |>
write_csv(here("data", "pardalotes.csv"))
Appendix B — Big data
Once the size of your dataset starts to approach the size of your computer’s memory (or vastly exceeds it!), it becomes difficult to process this data using R. There are a few different approaches to managing this, such as using the arrow package for larger than memory data workflows or the duckdb package for accessing and managing databases.
Here we briefly summarise how arrow and duckdb work. For more complete summaries of these packages, check out the arrow chapter and databases chapter in R for Data Science.
B.0.1 Prerequisites
In this chapter, we will use Pardalote occurrence data. You will need to save the pardalotes
data locally (i.e. write to disk) as a csv file to use over this chapter.
B.0.2 arrow
The arrow package allows users to analyse larger-than-memory datasets using dplyr verbs. A common workflow might consists of reading in some data, filtering or summarising according to some property of the data, and writing this to a new file. This is relatively simple to do with a small csv file, and we can do something conceptually similar using arrow for larger datasets1.
We can open the csv file we just downloaded as a dataset instead of reading it into memory…
library(arrow)
<- open_dataset(here("data", "pardalotes.csv"), format = "csv")
pardalotes glimpse(pardalotes)
FileSystemDataset with 1 csv file
557,184 rows x 8 columns
$ genus <string> "Pardalotus", "Pardalotus", "Pardalotus", "Pardalotu…
$ species <string> "Pardalotus striatus", "Pardalotus striatus", "Parda…
$ scientificName <string> "Pardalotus (Pardalotinus) striatus", "Pardalotus (P…
$ cl22 <string> "Queensland", "New South Wales", "New South Wales", …
$ year <int64> 2019, 2018, 2020, 2021, 2016, 2015, 2022, 2020, 2019…
$ month <int64> 6, 6, 6, 3, 7, 11, 9, 3, 6, 11, 8, 2, 6, 10, 8, 1, 1…
$ decimalLatitude <double> -27.57044, -28.85477, -34.25083, -36.53735, -35.3981…
$ decimalLongitude <double> 153.2038, 153.5448, 150.6065, 143.8612, 149.1175, 15…
…and perform some common operations on it before finally bringing it into memory with collect()
.
|>
pardalotes select(species, year, cl22) |>
filter(species != "NA", cl22 != "NA")|>
group_by(species, year, cl22) |>
summarise(count = n(), .groups = "drop") |>
collect()
# A tibble: 208 × 4
species year cl22 count
<chr> <int> <chr> <int>
1 Pardalotus striatus 2019 Queensland 9484
2 Pardalotus striatus 2018 New South Wales 6702
3 Pardalotus punctatus 2021 Victoria 19799
4 Pardalotus punctatus 2016 Australian Capital Territory 2315
5 Pardalotus punctatus 2020 New South Wales 7990
6 Pardalotus striatus 2015 New South Wales 3294
7 Pardalotus striatus 2022 Queensland 14627
8 Pardalotus punctatus 2020 Queensland 1174
9 Pardalotus striatus 2020 Victoria 11009
10 Pardalotus striatus 2018 Victoria 9484
# ℹ 198 more rows
arrow has many other functions that make working with big data as smooth as possible, such as reading and writing different file formats (including parquet2 and feather for efficient storage), partitioning datasets for more effective querying, working with multi-file datasets, and interacting with cloud storage.
B.0.3 duckdb and dbplyr
DuckDB is a database management system that can read, write, and manipulate larger-than-memory datasets using SQL (a standard language for accessing and manipulating databases). This can be very useful for working with large relational tables, for instance, or appending large amounts of data to existing datasets. To do access and manipulate these datasets programmatically in R without having to use SQL, we can use dbplyr together with duckdb.
As a small example, we’ll perform a similar operation to the one we did above using arrow. Here, we write the pardalotes
dataset to an in-memory database and then summarise it.
library(duckdb)
library(dbplyr)
<- dbConnect(duckdb())
con duckdb_register(con, "pardalotes", pardalotes)
::tbl(con, "pardalotes") |>
dplyrselect(species, year, cl22) |>
filter(species != "NA", cl22 != "NA")|>
group_by(species, year, cl22) |>
summarise(count = n(), .groups = "drop") |>
collect()
# A tibble: 208 × 4
species year cl22 count
<chr> <int> <chr> <dbl>
1 Pardalotus punctatus 2020 New South Wales 7990
2 Pardalotus punctatus 2021 Australian Capital Territory 5265
3 Pardalotus striatus 2020 Northern Territory 742
4 Pardalotus punctatus 2018 New South Wales 10337
5 Pardalotus striatus 2023 Western Australia 851
6 Pardalotus striatus 2021 Victoria 12208
7 Pardalotus striatus 2021 Western Australia 4311
8 Pardalotus striatus 2017 Northern Territory 628
9 Pardalotus striatus 2019 Australian Capital Territory 2778
10 Pardalotus punctatus 2022 Australian Capital Territory 4421
# ℹ 198 more rows
dbDisconnect(con)
One of the really cool things about duckdb and arrow is zero-copy integration, which allows you to pass datasets back and forth between the two engines within an analysis pipeline with very little loss of efficiency. If there is some functionality that is not supported by one of the packages but is supported by the other, you can simply switch in the middle of an analysis!
The duckdb and arrow blogs have written about this in greater detail here and here.
The dataset used in this example is relatively small (~ half a million rows), but the benefits of using arrow become obvious once the number of rows in your dataset approaches tens or hundreds of millions.↩︎
Parquet files use a columnar storage format (rather than a row storage format, as csv files do) and provide an especially efficient way to store and retrieve large datasets.↩︎