Appendix B — Big data

Once the size of your dataset starts to approach the size of your computer’s memory (or vastly exceeds it!), it becomes difficult to process this data using R. There are a few different approaches to managing this, such as using the arrow package for larger than memory data workflows or the duckdb package for accessing and managing databases.

Here we briefly summarise how arrow and duckdb work. For more complete summaries of these packages, check out the arrow chapter and databases chapter in R for Data Science.

B.0.1 Prerequisites

In this chapter, we will use Pardalote occurrence data. You will need to save the pardalotes data locally (i.e. write to disk) as a csv file to use over this chapter.

# packages
library(galah)
library(here)
library(readr)
galah_config(email = "your-email-here") # ALA-registered email

galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.4730caca-0570-4e63-9652-22f92d0b2e1a") |>
  atlas_occurrences() |>
  write_csv(here("data", "pardalotes.csv"))

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

pardalotes <- galah_call() |>
  identify("Pardalotus") |>
  filter(year >= 2015) |>
  select(genus, 
         species, 
         scientificName, 
         cl22,
         year,
         month, 
         decimalLatitude,
         decimalLongitude) |> 
  atlas_occurrences()
1
We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

B.0.2 arrow

The arrow package allows users to analyse larger-than-memory datasets using dplyr verbs. A common workflow might consists of reading in some data, filtering or summarising according to some property of the data, and writing this to a new file. This is relatively simple to do with a small csv file, and we can do something conceptually similar using arrow for larger datasets1.

We can open the csv file we just downloaded as a dataset instead of reading it into memory…

library(arrow)
pardalotes <- open_dataset(here("data", "pardalotes.csv"), format = "csv")
glimpse(pardalotes)
FileSystemDataset with 1 csv file
557,184 rows x 8 columns
$ genus            <string> "Pardalotus", "Pardalotus", "Pardalotus", "Pardalotu…
$ species          <string> "Pardalotus striatus", "Pardalotus striatus", "Parda…
$ scientificName   <string> "Pardalotus (Pardalotinus) striatus", "Pardalotus (P…
$ cl22             <string> "Queensland", "New South Wales", "New South Wales", …
$ year              <int64> 2019, 2018, 2020, 2021, 2016, 2015, 2022, 2020, 2019…
$ month             <int64> 6, 6, 6, 3, 7, 11, 9, 3, 6, 11, 8, 2, 6, 10, 8, 1, 1…
$ decimalLatitude  <double> -27.57044, -28.85477, -34.25083, -36.53735, -35.3981…
$ decimalLongitude <double> 153.2038, 153.5448, 150.6065, 143.8612, 149.1175, 15…

…and perform some common operations on it before finally bringing it into memory with collect().

pardalotes |> 
  select(species, year, cl22) |> 
  filter(species != "NA", cl22 != "NA")|> 
  group_by(species, year, cl22) |> 
  summarise(count = n(), .groups = "drop") |>
  collect()
# A tibble: 208 × 4
   species               year cl22                         count
   <chr>                <int> <chr>                        <int>
 1 Pardalotus striatus   2019 Queensland                    9484
 2 Pardalotus striatus   2018 New South Wales               6702
 3 Pardalotus punctatus  2021 Victoria                     19799
 4 Pardalotus punctatus  2016 Australian Capital Territory  2315
 5 Pardalotus punctatus  2020 New South Wales               7990
 6 Pardalotus striatus   2015 New South Wales               3294
 7 Pardalotus striatus   2022 Queensland                   14627
 8 Pardalotus punctatus  2020 Queensland                    1174
 9 Pardalotus striatus   2020 Victoria                     11009
10 Pardalotus striatus   2018 Victoria                      9484
# ℹ 198 more rows

arrow has many other functions that make working with big data as smooth as possible, such as reading and writing different file formats (including parquet2 and feather for efficient storage), partitioning datasets for more effective querying, working with multi-file datasets, and interacting with cloud storage.

B.0.3 duckdb and dbplyr

DuckDB is a database management system that can read, write, and manipulate larger-than-memory datasets using SQL (a standard language for accessing and manipulating databases). This can be very useful for working with large relational tables, for instance, or appending large amounts of data to existing datasets. To do access and manipulate these datasets programmatically in R without having to use SQL, we can use dbplyr together with duckdb.

As a small example, we’ll perform a similar operation to the one we did above using arrow. Here, we write the pardalotes dataset to an in-memory database and then summarise it.

library(duckdb)
library(dbplyr)  

con <- dbConnect(duckdb()) 
duckdb_register(con, "pardalotes", pardalotes)  
  
dplyr::tbl(con, "pardalotes") |>
  select(species, year, cl22) |> 
  filter(species != "NA", cl22 != "NA")|> 
  group_by(species, year, cl22) |> 
  summarise(count = n(), .groups = "drop") |>
  collect()
# A tibble: 208 × 4
   species               year cl22                         count
   <chr>                <int> <chr>                        <dbl>
 1 Pardalotus punctatus  2020 New South Wales               7990
 2 Pardalotus punctatus  2021 Australian Capital Territory  5265
 3 Pardalotus striatus   2020 Northern Territory             742
 4 Pardalotus punctatus  2018 New South Wales              10337
 5 Pardalotus striatus   2023 Western Australia              851
 6 Pardalotus striatus   2021 Victoria                     12208
 7 Pardalotus striatus   2021 Western Australia             4311
 8 Pardalotus striatus   2017 Northern Territory             628
 9 Pardalotus striatus   2019 Australian Capital Territory  2778
10 Pardalotus punctatus  2022 Australian Capital Territory  4421
# ℹ 198 more rows
dbDisconnect(con)  

One of the really cool things about duckdb and arrow is zero-copy integration, which allows you to pass datasets back and forth between the two engines within an analysis pipeline with very little loss of efficiency. If there is some functionality that is not supported by one of the packages but is supported by the other, you can simply switch in the middle of an analysis!

The duckdb and arrow blogs have written about this in greater detail here and here.


  1. The dataset used in this example is relatively small (~ half a million rows), but the benefits of using arrow become obvious once the number of rows in your dataset approaches tens or hundreds of millions.↩︎

  2. Parquet files use a columnar storage format (rather than a row storage format, as csv files do) and provide an especially efficient way to store and retrieve large datasets.↩︎