# packages
library(galah)
library(dplyr)
galah_config(email = "your-email-here") # ALA-registered email
plants <- galah_call() |>
  filter(doi == "https://doi.org/10.26197/ala.96edc0d3-b0be-475c-a51f-0a40b8ad2162") |>
  atlas_occurrences()7 Dates
Information about dates and times can be formatted in many different ways in datasets. For example, you might encounter:
- "2020-10-11T14:02:00"
- "11 October 2020"
- "10/11/2020"
It’s common to find multiple date formats within a single dataset, especially when combining data from various sources. Often, you’ll need to extract different types of information from dates, such as the week number or the interval between two dates. You may also need to filter your dataset by dates or times.
In this chapter, we’ll introduce some basic functions for cleaning dates and times. We’ll also demonstrate how to reformat and filter your dataset using dates.
For a more comprehensive guide on handling dates and times in R, the Dates and Times chapter in R for Data Science is a fantastic resource.
7.0.1 Prerequisites
In this chapter, we will use Spider flower (Grevillea) occurrence records in the ALA.
7.1 Basic date manipulation
One of the most useful data cleaning packages for dates is the lubridate package. Below are some examples of common date cleaning functions.
library(lubridate)7.1.1 Reformat
We can use functions in the lubridate package to reformat dates written in different ways to YYYY-MM-DD format.
date("2017-10-11T14:02:00")[1] "2017-10-11"dmy("11 October 2020")[1] "2020-10-11"mdy("10/11/2020")[1] "2020-10-11"Sometimes dates are presented in formats that do not translate cleanly into R. For example, the following date format isn’t converted correctly when we try to convert it to a date.
df <- tibble(
  date = c("X2020.01.22",
           "X2020.01.22",
           "X2020.01.22",
           "X2020.01.22")
)
df |> 
  mutate(
    date = as_date(date)
  )Warning: There was 1 warning in `mutate()`.
ℹ In argument: `date = as_date(date)`.
Caused by warning:
! All formats failed to parse. No formats found.# A tibble: 4 × 1
  date  
  <date>
1 NA    
2 NA    
3 NA    
4 NA    We can use % to be more explicit about what information is in each part of our date column, specifying where the 4-digit year (%Y), 2-digit month (%m) and 2 digit day (%d) are within each string. Learn more about date formats in the dates chapter in R for Data Science.
df |> 
  mutate(
    date = as_date(date, format = "X%Y.%m.%d")
  )# A tibble: 4 × 1
  date      
  <date>    
1 2020-01-22
2 2020-01-22
3 2020-01-22
4 2020-01-227.1.2 Extract
Sometimes we might need to extract certain elements of a longer date-time value for summarising, filtering, or plotting data.
Date information
year("2017-11-28T14:02:00")[1] 2017month("2017-11-28T14:02:00")[1] 11week("2017-11-28T14:02:00")[1] 48day("2017-11-28T14:02:00")[1] 28Time information
ymd_hms("2017-11-28T14:02:00")[1] "2017-11-28 14:02:00 UTC"ymd_hms("2017-11-28T14:02:00", tz = "Australia/Melbourne")[1] "2017-11-28 14:02:00 AEDT"ymd_hms("2017-11-28T14:02:00") |> hour()[1] 14ymd_hms("2017-11-28T14:02:00") |> minute()[1] 2am("2017-11-28T14:02:00")[1] TRUEpm("2017-11-28T14:02:00")[1] FALSE7.1.3 An example using galah
Data downloaded using the galah package are loaded into R as date and time data (class POSIXct). As a result, you can immediately begin extracting date/time information using the functions above.
plants |>
  mutate(
    year = year(eventDate),
    month = month(eventDate),
    week = isoweek(eventDate),
    day_julian = yday(eventDate)
    ) |>
  select(eventDate, year, month, 
         week, day_julian)# A tibble: 216,833 × 5
   eventDate            year month  week day_julian
   <dttm>              <dbl> <dbl> <dbl>      <dbl>
 1 NA                     NA    NA    NA         NA
 2 NA                     NA    NA    NA         NA
 3 2018-07-08 00:00:00  2018     7    27        189
 4 1953-11-19 00:00:00  1953    11    47        323
 5 2019-11-14 00:00:00  2019    11    46        318
 6 2003-09-19 00:00:00  2003     9    38        262
 7 2019-09-06 00:00:00  2019     9    36        249
 8 2022-08-31 00:00:00  2022     8    35        243
 9 1996-10-30 00:00:00  1996    10    44        304
10 1948-09-20 00:00:00  1948     9    39        264
# ℹ 216,823 more rowsKeep in mind that some occurrence records don’t have date information (either it wasn’t supplied or was deemed unreliable eg a non-existent date). Rows with missing dates will appear as NA values.
7.2 Filter
We can filter datasets to include or exclude data from certain dates or date ranges.
# return records after 2015
plants |>
  filter(eventDate >= ymd("2016-01-01"))# A tibble: 75,308 × 8
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 00005eb8-4b84… Grevillea mur… https://id.bi…           -35.8             138.
 2 000143a7-20e2… Grevillea jun… https://id.bi…           -33.7             151.
 3 00020893-9823… Grevillea aqu… https://id.bi…           -37.3             142.
 4 0002cd90-4099… Grevillea tre… https://id.bi…           -30.9             134.
 5 00040117-eae2… Grevillea par… https://id.bi…           -32.8             152.
 6 00045361-870d… Grevillea bux… https://id.bi…           -33.7             151.
 7 00053e34-1bdb… Grevillea ros… https://id.bi…           -37.7             145.
 8 0006e628-69d5… Grevillea rhi… https://id.bi…           -29.5             152.
 9 00075a74-e3de… Grevillea par… https://id.bi…           -34.2             151.
10 0008295a-79de… Grevillea cel… https://id.bi…           -37.7             148.
# ℹ 75,298 more rows
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr># return records between 2015 & 2018
plants |> 
  filter(eventDate >= ymd("2016-01-01") & 
           eventDate <= ymd("2017-12-31"))# A tibble: 9,830 × 8
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 000b499b-dc80… Grevillea bux… https://id.bi…           -33.7             151.
 2 000e2c2b-953e… Grevillea hil… https://id.bi…            NA                NA 
 3 0014b88f-ecc5… Grevillea gut… https://id.bi…           -32.5             152.
 4 001ac6ed-1de5… Grevillea ser… https://id.bi…           -32.2             150.
 5 001c6599-0f27… Grevillea aca… https://id.bi…           -33.4             150.
 6 001f301f-0192… Grevillea dry… https://id.bi…           -12.5             131.
 7 002a8fb4-e1f9… Grevillea cal… https://id.bi…           -33.7             151.
 8 002e9fb9-ff3b… Grevillea cal… https://id.bi…           -33.7             151.
 9 003484c8-b1f3… Grevillea eri… https://id.bi…           -25.4             131.
10 0038f556-0591… Grevillea lin… https://id.bi…           -33.7             151.
# ℹ 9,820 more rows
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>7.3 Summary
In this chapter, we introduced common functions for cleaning dates and times in ecological datasets. For some data cleaning tasks, you may need to explore more advanced workflows to handle specific date and time formatting requirements.
In the next chapter, we look at issues that require more advanced ecological data cleaning techniques.
