Appendix B — Big data

Once the size of your dataset starts to approach the size of your computer’s memory (or vastly exceeds it!), it becomes difficult to process this data using R. There are a few different approaches to managing this, such as using the arrow package for larger than memory data workflows or the duckdb package for accessing and managing databases.

Here we briefly summarise how arrow and duckdb work. For more complete summaries of these packages, check out the arrow chapter and databases chapter in R for Data Science.

B.0.1 Prerequisites

In this chapter, we will use Pardalote occurrence data. You will need to save the pardalotes data locally (i.e. write to disk) as a csv file to use over this chapter.

# packages
library(galah)
library(here)
library(readr)
galah_config(email = "your-email-here") # ALA-registered email

galah_call() |>
  filter(doi == "https://doi.org/10.26197/ala.ebc86d15-47ca-45fe-b9bb-4fd2e9c76bc0") |>
  atlas_occurrences() |>
  write_csv(here("data", "pardalotes.csv"))

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

pardalotes <- galah_call() |>
  identify("Pardalotus") |>
  filter(year >= 2015) |>
  select(genus, 
         species, 
         scientificName, 
         cl22,
         year,
         month, 
         decimalLatitude,
         decimalLongitude) |> 
  atlas_occurrences()
1
We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

B.0.2 arrow

The arrow package allows users to analyse larger-than-memory datasets using dplyr verbs. A common workflow might consists of reading in some data, filtering or summarising according to some property of the data, and writing this to a new file. This is relatively simple to do with a small csv file, and we can do something conceptually similar using arrow for larger datasets1.

We can open the csv file we just downloaded as a dataset instead of reading it into memory…

library(arrow)
pardalotes <- open_dataset(here("data", "pardalotes.csv"), format = "csv")
glimpse(pardalotes)
FileSystemDataset with 1 csv file
696,212 rows x 8 columns
$ genus            <string> "Pardalotus", "Pardalotus", "Pardalotus", "Pardalotu…
$ species          <string> "Pardalotus striatus", "Pardalotus striatus", "Parda…
$ scientificName   <string> "Pardalotus (Pardalotinus) striatus", "Pardalotus (P…
$ cl22             <string> "New South Wales", "Victoria", "South Australia", "S…
$ year              <int64> 2024, 2021, 2018, 2017, 2016, 2020, 2020, 2016, 2017…
$ month             <int64> 3, 2, 10, 12, 1, 6, 9, 4, 11, 10, 6, 5, 9, 4, 5, 10,…
$ decimalLatitude  <double> -35.15871, -37.05037, -34.17722, -33.13484, -16.5000…
$ decimalLongitude <double> 148.4348, 145.1741, 138.8347, 136.4116, 144.8970, 15…

…and perform some common operations on it before finally bringing it into memory with collect().

pardalotes |> 
  select(species, year, cl22) |> 
  filter(species != "NA", cl22 != "NA")|> 
  group_by(species, year, cl22) |> 
  summarise(count = n(), .groups = "drop") |>
  collect()
# A tibble: 230 × 4
   species               year cl22            count
   <chr>                <int> <chr>           <int>
 1 Pardalotus striatus   2024 New South Wales  3300
 2 Pardalotus striatus   2021 Victoria        12163
 3 Pardalotus striatus   2018 South Australia  2126
 4 Pardalotus striatus   2017 South Australia  2649
 5 Pardalotus striatus   2016 Queensland       6946
 6 Pardalotus striatus   2020 New South Wales  6364
 7 Pardalotus striatus   2020 Victoria        11009
 8 Pardalotus punctatus  2016 Victoria         8344
 9 Pardalotus striatus   2017 New South Wales  5138
10 Pardalotus striatus   2020 South Australia  1881
# ℹ 220 more rows

arrow has many other functions that make working with big data as smooth as possible, such as reading and writing different file formats (including parquet2 and feather for efficient storage), partitioning datasets for more effective querying, working with multi-file datasets, and interacting with cloud storage.

B.0.3 duckdb and dbplyr

DuckDB is a database management system that can read, write, and manipulate larger-than-memory datasets using SQL (a standard language for accessing and manipulating databases). This can be very useful for working with large relational tables, for instance, or appending large amounts of data to existing datasets. To do access and manipulate these datasets programmatically in R without having to use SQL, we can use dbplyr together with duckdb.

As a small example, we’ll perform a similar operation to the one we did above using arrow. Here, we write the pardalotes dataset to an in-memory database and then summarise it.

library(duckdb)
library(dbplyr)  

con <- dbConnect(duckdb()) 
duckdb_register(con, "pardalotes", pardalotes)  
  
dplyr::tbl(con, "pardalotes") |>
  select(species, year, cl22) |> 
  filter(species != "NA", cl22 != "NA")|> 
  group_by(species, year, cl22) |> 
  summarise(count = n(), .groups = "drop") |>
  collect()
# A tibble: 230 × 4
   species                year cl22                         count
   <chr>                 <int> <chr>                        <dbl>
 1 Pardalotus striatus    2020 New South Wales               6364
 2 Pardalotus striatus    2018 New South Wales               6726
 3 Pardalotus punctatus   2022 Australian Capital Territory  4426
 4 Pardalotus rubricatus  2018 Queensland                     212
 5 Pardalotus striatus    2024 Western Australia             1704
 6 Pardalotus punctatus   2024 South Australia                891
 7 Pardalotus striatus    2024 Australian Capital Territory  1146
 8 Pardalotus punctatus   2015 Australian Capital Territory  2322
 9 Pardalotus punctatus   2022 Western Australia             1010
10 Pardalotus rubricatus  2021 Northern Territory             206
# ℹ 220 more rows
dbDisconnect(con)  

One of the really cool things about duckdb and arrow is zero-copy integration, which allows you to pass datasets back and forth between the two engines within an analysis pipeline with very little loss of efficiency. If there is some functionality that is not supported by one of the packages but is supported by the other, you can simply switch in the middle of an analysis!

The duckdb and arrow blogs have written about this in greater detail here and here.


  1. The dataset used in this example is relatively small (~ half a million rows), but the benefits of using arrow become obvious once the number of rows in your dataset approaches tens or hundreds of millions.↩︎

  2. Parquet files use a columnar storage format (rather than a row storage format, as csv files do) and provide an especially efficient way to store and retrieve large datasets.↩︎