# packages
library(galah)
library(dplyr)
galah_config(email = "your-email-here") # ALA-registered email
<- galah_call() |>
plants filter(doi == "https://doi.org /10.26197/ala.27544a26-31b5-4a36-955d-30a1bb8a4636") |>
atlas_occurrences()
7 Dates
Information about dates and times can be formatted in many different ways in datasets. For example, you might encounter:
"2020-10-11T14:02:00"
"11 October 2020"
"10/11/2020"
It’s common to find multiple date formats within a single dataset, especially when combining data from various sources. Often, you’ll need to extract different types of information from dates, such as the week number or the interval between two dates. You may also need to filter your dataset by dates or times.
In this chapter, we’ll introduce some basic functions for cleaning dates and times. We’ll also demonstrate how to reformat and filter your dataset using dates.
For a more comprehensive guide on handling dates and times in R, the Dates and Times chapter in R for Data Science is a fantastic resource.
7.0.1 Prerequisites
In this chapter, we will use Spider flower (Grevillea) occurrence records in the ALA.
7.1 Basic date manipulation
One of the most useful data cleaning packages for dates is the lubridate package. Below are some examples of common date cleaning functions.
library(lubridate)
7.1.1 Reformat
We can use functions in the lubridate package to reformat dates written in different ways to YYYY-MM-DD format.
date("2017-10-11T14:02:00")
[1] "2017-10-11"
dmy("11 October 2020")
[1] "2020-10-11"
mdy("10/11/2020")
[1] "2020-10-11"
Sometimes dates are presented in formats that do not translate cleanly into R. For example, the following date format isn’t converted correctly when we try to convert it to a date.
<- tibble(
df date = c("X2020.01.22",
"X2020.01.22",
"X2020.01.22",
"X2020.01.22")
)
|>
df mutate(
date = as_date(date)
)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `date = as_date(date)`.
Caused by warning:
! All formats failed to parse. No formats found.
# A tibble: 4 × 1
date
<date>
1 NA
2 NA
3 NA
4 NA
We can use %
to be more explicit about what information is in each part of our date column, specifying where the 4-digit year (%Y
), 2-digit month (%m
) and 2 digit day (%d
) are within each string. Learn more about date formats in the dates chapter in R for Data Science.
|>
df mutate(
date = as_date(date, format = "X%Y.%m.%d")
)
# A tibble: 4 × 1
date
<date>
1 2020-01-22
2 2020-01-22
3 2020-01-22
4 2020-01-22
7.1.2 Extract
Sometimes we might need to extract certain elements of a longer date-time value for summarising, filtering, or plotting data.
Date information
year("2017-11-28T14:02:00")
[1] 2017
month("2017-11-28T14:02:00")
[1] 11
week("2017-11-28T14:02:00")
[1] 48
day("2017-11-28T14:02:00")
[1] 28
Time information
ymd_hms("2017-11-28T14:02:00")
[1] "2017-11-28 14:02:00 UTC"
ymd_hms("2017-11-28T14:02:00", tz = "Australia/Melbourne")
[1] "2017-11-28 14:02:00 AEDT"
ymd_hms("2017-11-28T14:02:00") |> hour()
[1] 14
ymd_hms("2017-11-28T14:02:00") |> minute()
[1] 2
am("2017-11-28T14:02:00")
[1] TRUE
pm("2017-11-28T14:02:00")
[1] FALSE
7.1.3 An example using galah
Data downloaded using the galah package are loaded into R as date and time data (class POSIXct
). As a result, you can immediately begin extracting date/time information using the functions above.
|>
plants mutate(
year = year(eventDate),
month = month(eventDate),
week = isoweek(eventDate),
day_julian = yday(eventDate)
|>
) select(eventDate, year, month,
week, day_julian)
# A tibble: 194,926 × 5
eventDate year month week day_julian
<dttm> <dbl> <dbl> <dbl> <dbl>
1 2018-07-08 00:00:00 2018 7 27 189
2 2019-11-14 00:00:00 2019 11 46 318
3 2003-09-19 00:00:00 2003 9 38 262
4 2019-09-06 00:00:00 2019 9 36 249
5 2022-08-31 00:00:00 2022 8 35 243
6 1996-10-30 00:00:00 1996 10 44 304
7 1948-09-20 00:00:00 1948 9 39 264
8 1988-10-13 00:00:00 1988 10 41 287
9 2022-10-10 00:00:00 2022 10 41 283
10 2020-10-21 00:00:00 2020 10 43 295
# ℹ 194,916 more rows
7.2 Filter
We can filter datasets to include or exclude data from certain dates or date ranges.
# return records after 2015
|>
plants filter(eventDate >= ymd("2016-01-01"))
# A tibble: 57,771 × 8
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 00005eb8-4b84… Grevillea mur… https://id.bi… -35.8 138.
2 000143a7-20e2… Grevillea jun… https://id.bi… -33.7 151.
3 00020893-9823… Grevillea aqu… https://id.bi… -37.3 142.
4 0002cd90-4099… Grevillea tre… https://id.bi… -30.9 134.
5 00040117-eae2… Grevillea par… https://id.bi… -32.8 152.
6 00045361-870d… Grevillea bux… https://id.bi… -33.7 151.
7 00053e34-1bdb… Grevillea ros… https://id.bi… -37.7 145.
8 0006e628-69d5… Grevillea rhi… https://id.bi… -29.5 152.
9 00075a74-e3de… Grevillea par… https://id.bi… -34.2 151.
10 0008295a-79de… Grevillea cel… https://id.bi… -37.7 148.
# ℹ 57,761 more rows
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>
# return records between 2015 & 2018
|>
plants filter(eventDate >= ymd("2016-01-01") &
<= ymd("2017-12-31")) eventDate
# A tibble: 9,246 × 8
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 000b499b-dc80… Grevillea bux… https://id.bi… -33.7 151.
2 000e2c2b-953e… Grevillea hil… https://id.bi… NA NA
3 0014b88f-ecc5… Grevillea gut… https://id.bi… -32.5 152.
4 001ac6ed-1de5… Grevillea ser… https://id.bi… -32.2 150.
5 001c6599-0f27… Grevillea aca… https://id.bi… -33.4 150.
6 001f301f-0192… Grevillea dry… https://id.bi… -12.5 131.
7 002a8fb4-e1f9… Grevillea cal… https://id.bi… -33.7 151.
8 002e9fb9-ff3b… Grevillea cal… https://id.bi… -33.7 151.
9 0038f556-0591… Grevillea lin… https://id.bi… -33.7 151.
10 003d2b67-6db4… Grevillea par… https://id.bi… -34.3 151.
# ℹ 9,236 more rows
# ℹ 3 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>
7.3 Summary
In this chapter, we introduced common functions for cleaning dates and times in ecological datasets. For some data cleaning tasks, you may need to explore more advanced workflows to handle specific date and time formatting requirements.
In the next chapter, we look at issues that require more advanced ecological data cleaning techniques.