7  Dates

Information about dates and times can be formatted in many different ways in datasets. For example, you might encounter:

It’s common to find multiple date formats within a single dataset, especially when combining data from various sources. Often, you’ll need to extract different types of information from dates, such as the week number or the interval between two dates. You may also need to filter your dataset by dates or times.

In this chapter, we’ll introduce some basic functions for cleaning dates and times. We’ll also demonstrate how to reformat and filter your dataset using dates.

For a more comprehensive guide on handling dates and times in R, the Dates and Times chapter in R for Data Science is a fantastic resource.

7.0.1 Prerequisites

In this chapter, we will use Spider flower (Grevillea) occurrence records in the ALA.

# packages
library(galah)
library(dplyr)
galah_config(email = "your-email-here") # ALA-registered email

plants <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.27544a26-31b5-4a36-955d-30a1bb8a4636") |>
  atlas_occurrences()

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

plants <- galah_call() |>
  identify("grevillea") |>
  atlas_occurrences()
1
We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

7.1 Basic date manipulation

One of the most useful data cleaning packages for dates is the lubridate package. Below are some examples of common date cleaning functions.

library(lubridate)

7.1.1 Reformat

We can use functions in the lubridate package to reformat dates written in different ways to YYYY-MM-DD format.

date("2017-10-11T14:02:00")
[1] "2017-10-11"
dmy("11 October 2020")
[1] "2020-10-11"
mdy("10/11/2020")
[1] "2020-10-11"

Sometimes dates are presented in formats that do not translate cleanly into R. For example, the following date format isn’t converted correctly when we try to convert it to a date.

df <- tibble(
  date = c("X2020.01.22",
           "X2020.01.22",
           "X2020.01.22",
           "X2020.01.22")
)

df |> 
  mutate(
    date = as_date(date)
  )
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `date = as_date(date)`.
Caused by warning:
! All formats failed to parse. No formats found.
# A tibble: 4 × 1
  date  
  <date>
1 NA    
2 NA    
3 NA    
4 NA    

We can use % to be more explicit about what information is in each part of our date column, specifying where the 4-digit year (%Y), 2-digit month (%m) and 2 digit day (%d) are within each string. Learn more about date formats in the dates chapter in R for Data Science.

df |> 
  mutate(
    date = as_date(date, format = "X%Y.%m.%d")
  )
# A tibble: 4 × 1
  date      
  <date>    
1 2020-01-22
2 2020-01-22
3 2020-01-22
4 2020-01-22

7.1.2 Extract

Sometimes we might need to extract certain elements of a longer date-time value for summarising, filtering, or plotting data.

Date information

year("2017-11-28T14:02:00")
[1] 2017
month("2017-11-28T14:02:00")
[1] 11
week("2017-11-28T14:02:00")
[1] 48
day("2017-11-28T14:02:00")
[1] 28

Time information

ymd_hms("2017-11-28T14:02:00")
[1] "2017-11-28 14:02:00 UTC"
ymd_hms("2017-11-28T14:02:00", tz = "Australia/Melbourne")
[1] "2017-11-28 14:02:00 AEDT"
ymd_hms("2017-11-28T14:02:00") |> hour()
[1] 14
ymd_hms("2017-11-28T14:02:00") |> minute()
[1] 2
am("2017-11-28T14:02:00")
[1] TRUE
pm("2017-11-28T14:02:00")
[1] FALSE

7.1.3 An example using galah

Data downloaded using the galah package are loaded into R as date and time data (class POSIXct). As a result, you can immediately begin extracting date/time information using the functions above.

plants |>
  mutate(
    year = year(eventDate),
    month = month(eventDate),
    week = isoweek(eventDate),
    day_julian = yday(eventDate)
    ) |>
  select(eventDate, year, month, 
         week, day_julian)
# A tibble: 194,926 × 5
   eventDate            year month  week day_julian
   <dttm>              <dbl> <dbl> <dbl>      <dbl>
 1 2018-07-08 00:00:00  2018     7    27        189
 2 2019-11-14 00:00:00  2019    11    46        318
 3 2003-09-19 00:00:00  2003     9    38        262
 4 2019-09-06 00:00:00  2019     9    36        249
 5 2022-08-31 00:00:00  2022     8    35        243
 6 1996-10-30 00:00:00  1996    10    44        304
 7 1948-09-20 00:00:00  1948     9    39        264
 8 1988-10-13 00:00:00  1988    10    41        287
 9 2022-10-10 00:00:00  2022    10    41        283
10 2020-10-21 00:00:00  2020    10    43        295
# ℹ 194,916 more rows

7.2 Filter

We can filter datasets to include or exclude data from certain dates or date ranges.

# return records after 2015
plants |>
  filter(eventDate >= ymd("2016-01-01"))
# A tibble: 57,771 × 8
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 00005eb8-4b84… Grevillea mur… https://id.bi…           -35.8             138.
 2 000143a7-20e2… Grevillea jun… https://id.bi…           -33.7             151.
 3 00020893-9823… Grevillea aqu… https://id.bi…           -37.3             142.
 4 0002cd90-4099… Grevillea tre… https://id.bi…           -30.9             134.
 5 00040117-eae2… Grevillea par… https://id.bi…           -32.8             152.
 6 00045361-870d… Grevillea bux… https://id.bi…           -33.7             151.
 7 00053e34-1bdb… Grevillea ros… https://id.bi…           -37.7             145.
 8 0006e628-69d5… Grevillea rhi… https://id.bi…           -29.5             152.
 9 00075a74-e3de… Grevillea par… https://id.bi…           -34.2             151.
10 0008295a-79de… Grevillea cel… https://id.bi…           -37.7             148.
# ℹ 57,761 more rows
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>
# return records between 2015 & 2018
plants |> 
  filter(eventDate >= ymd("2016-01-01") & 
           eventDate <= ymd("2017-12-31"))
# A tibble: 9,246 × 8
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 000b499b-dc80… Grevillea bux… https://id.bi…           -33.7             151.
 2 000e2c2b-953e… Grevillea hil… https://id.bi…            NA                NA 
 3 0014b88f-ecc5… Grevillea gut… https://id.bi…           -32.5             152.
 4 001ac6ed-1de5… Grevillea ser… https://id.bi…           -32.2             150.
 5 001c6599-0f27… Grevillea aca… https://id.bi…           -33.4             150.
 6 001f301f-0192… Grevillea dry… https://id.bi…           -12.5             131.
 7 002a8fb4-e1f9… Grevillea cal… https://id.bi…           -33.7             151.
 8 002e9fb9-ff3b… Grevillea cal… https://id.bi…           -33.7             151.
 9 0038f556-0591… Grevillea lin… https://id.bi…           -33.7             151.
10 003d2b67-6db4… Grevillea par… https://id.bi…           -34.3             151.
# ℹ 9,236 more rows
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>

7.3 Summary

In this chapter, we introduced common functions for cleaning dates and times in ecological datasets. For some data cleaning tasks, you may need to explore more advanced workflows to handle specific date and time formatting requirements.

In the next chapter, we look at issues that require more advanced ecological data cleaning techniques.