# packages
library(galah)
library(here)
library(readr)
galah_config(email = "your-email-here") # ALA-registered email
galah_call() |>
filter(doi == "https://doi.org/10.26197/ala.ebc86d15-47ca-45fe-b9bb-4fd2e9c76bc0") |>
atlas_occurrences() |>
write_csv(here("data", "pardalotes.csv"))
Appendix B — Big data
Once the size of your dataset starts to approach the size of your computer’s memory (or vastly exceeds it!), it becomes difficult to process this data using R. There are a few different approaches to managing this, such as using the arrow package for larger than memory data workflows or the duckdb package for accessing and managing databases.
Here we briefly summarise how arrow and duckdb work. For more complete summaries of these packages, check out the arrow chapter and databases chapter in R for Data Science.
B.0.1 Prerequisites
In this chapter, we will use Pardalote occurrence data. You will need to save the pardalotes
data locally (i.e. write to disk) as a csv file to use over this chapter.
B.0.2 arrow
The arrow package allows users to analyse larger-than-memory datasets using dplyr verbs. A common workflow might consists of reading in some data, filtering or summarising according to some property of the data, and writing this to a new file. This is relatively simple to do with a small csv file, and we can do something conceptually similar using arrow for larger datasets1.
We can open the csv file we just downloaded as a dataset instead of reading it into memory…
library(arrow)
<- open_dataset(here("data", "pardalotes.csv"), format = "csv")
pardalotes glimpse(pardalotes)
FileSystemDataset with 1 csv file
696,212 rows x 8 columns
$ genus <string> "Pardalotus", "Pardalotus", "Pardalotus", "Pardalotu…
$ species <string> "Pardalotus striatus", "Pardalotus striatus", "Parda…
$ scientificName <string> "Pardalotus (Pardalotinus) striatus", "Pardalotus (P…
$ cl22 <string> "New South Wales", "Victoria", "South Australia", "S…
$ year <int64> 2024, 2021, 2018, 2017, 2016, 2020, 2020, 2016, 2017…
$ month <int64> 3, 2, 10, 12, 1, 6, 9, 4, 11, 10, 6, 5, 9, 4, 5, 10,…
$ decimalLatitude <double> -35.15871, -37.05037, -34.17722, -33.13484, -16.5000…
$ decimalLongitude <double> 148.4348, 145.1741, 138.8347, 136.4116, 144.8970, 15…
…and perform some common operations on it before finally bringing it into memory with collect()
.
|>
pardalotes select(species, year, cl22) |>
filter(species != "NA", cl22 != "NA")|>
group_by(species, year, cl22) |>
summarise(count = n(), .groups = "drop") |>
collect()
# A tibble: 230 × 4
species year cl22 count
<chr> <int> <chr> <int>
1 Pardalotus striatus 2024 New South Wales 3300
2 Pardalotus striatus 2021 Victoria 12163
3 Pardalotus striatus 2018 South Australia 2126
4 Pardalotus striatus 2017 South Australia 2649
5 Pardalotus striatus 2016 Queensland 6946
6 Pardalotus striatus 2020 New South Wales 6364
7 Pardalotus striatus 2020 Victoria 11009
8 Pardalotus punctatus 2016 Victoria 8344
9 Pardalotus striatus 2017 New South Wales 5138
10 Pardalotus striatus 2020 South Australia 1881
# ℹ 220 more rows
arrow has many other functions that make working with big data as smooth as possible, such as reading and writing different file formats (including parquet2 and feather for efficient storage), partitioning datasets for more effective querying, working with multi-file datasets, and interacting with cloud storage.
B.0.3 duckdb and dbplyr
DuckDB is a database management system that can read, write, and manipulate larger-than-memory datasets using SQL (a standard language for accessing and manipulating databases). This can be very useful for working with large relational tables, for instance, or appending large amounts of data to existing datasets. To do access and manipulate these datasets programmatically in R without having to use SQL, we can use dbplyr together with duckdb.
As a small example, we’ll perform a similar operation to the one we did above using arrow. Here, we write the pardalotes
dataset to an in-memory database and then summarise it.
library(duckdb)
library(dbplyr)
<- dbConnect(duckdb())
con duckdb_register(con, "pardalotes", pardalotes)
::tbl(con, "pardalotes") |>
dplyrselect(species, year, cl22) |>
filter(species != "NA", cl22 != "NA")|>
group_by(species, year, cl22) |>
summarise(count = n(), .groups = "drop") |>
collect()
# A tibble: 230 × 4
species year cl22 count
<chr> <int> <chr> <dbl>
1 Pardalotus striatus 2020 New South Wales 6364
2 Pardalotus striatus 2018 New South Wales 6726
3 Pardalotus punctatus 2022 Australian Capital Territory 4426
4 Pardalotus rubricatus 2018 Queensland 212
5 Pardalotus striatus 2024 Western Australia 1704
6 Pardalotus punctatus 2024 South Australia 891
7 Pardalotus striatus 2024 Australian Capital Territory 1146
8 Pardalotus punctatus 2015 Australian Capital Territory 2322
9 Pardalotus punctatus 2022 Western Australia 1010
10 Pardalotus rubricatus 2021 Northern Territory 206
# ℹ 220 more rows
dbDisconnect(con)
One of the really cool things about duckdb and arrow is zero-copy integration, which allows you to pass datasets back and forth between the two engines within an analysis pipeline with very little loss of efficiency. If there is some functionality that is not supported by one of the packages but is supported by the other, you can simply switch in the middle of an analysis!
The duckdb and arrow blogs have written about this in greater detail here and here.
The dataset used in this example is relatively small (~ half a million rows), but the benefits of using arrow become obvious once the number of rows in your dataset approaches tens or hundreds of millions.↩︎
Parquet files use a columnar storage format (rather than a row storage format, as csv files do) and provide an especially efficient way to store and retrieve large datasets.↩︎