# packages
library(galah)
library(dplyr)
galah_config(email = "your-email-here") # ALA-registered email
<- galah_call() |>
frogs filter(doi == "https://doi.org /10.26197/ala.dde3e89f-28bf-4515-9128-ab18bce19a08") |>
atlas_occurrences()
3 Column classes & names
Each column in a dataset contains values of a specific type, or class. A class defines the type of data in a column and determines how those data are interpreted in R, and how we can modify those data. For instance, it doesn’t make sense to apply a mathematical equation to a word or a sentence. The phrase “hello” doesn’t reveal whether something is true or false. Knowing what types, or classes, of data are in each column of your table will ensure those data behave as expected later on. Classes are important to understand because, generally, functions only work on compatible data types.
Column names can also cause compatibility issues when working with a dataset. Depending on the source of your dataset, existing column names may be uninformative (e.g. col1
, tga42.D
), oddly formatted once imported into R (e.g. How.Much.Soil.Is.In.This.Plot..
), or internally inconsistent (e.g. species_name
, scientificName
). Modifying these can make it much easier to work with the data and avoid errors caused by mismatched or confusing column names.
This chapter explains how to check the class of each column and edit column names so that they are consistent and ready to use for analyses.
3.0.1 Prerequisites
In this chapter, we will use Litoria frog occurrence data since 2020 in Tasmania from the ALA.
3.1 Column classes
Columns define what type of data they contain by having a class, and it’s important to know what these classes are because R handles each class differently.
Viewing your data using functions we introduced in the Inspect chapter allows you to get a quick overview of each column’s class.
Using glimpse()
displays the class beside each column name (e.g. <chr>)
library(dplyr)
glimpse(frogs)
Rows: 2,763
Columns: 10
$ recordID <chr> "00052544-d943-42e9-bd85-83693c6dd824", "00168ca6-84d…
$ scientificName <chr> "Litoria ewingii", "Litoria raniformis", "Litoria ewi…
$ taxonConceptID <chr> "https://biodiversity.org.au/afd/taxa/1c89eedb-42b6-4…
$ decimalLatitude <dbl> -42.87917, -41.19207, -42.98559, -41.15305, -42.85886…
$ decimalLongitude <dbl> 147.4754, 146.4331, 147.0589, 146.5241, 147.6137, 147…
$ eventDate <dttm> 2022-09-19 00:00:00, 2023-12-20 23:20:19, 2021-08-07…
$ occurrenceStatus <chr> "PRESENT", "PRESENT", "PRESENT", "PRESENT", "PRESENT"…
$ dataResourceName <chr> "FrogID", "iNaturalist Australia", "FrogID", "iNatura…
$ genus <chr> "Litoria", "Litoria", "Litoria", "Litoria", "Litoria"…
$ species <chr> "Litoria ewingii", "Litoria raniformis", "Litoria ewi…
Using str()
displays the class after the column name and before the number of rows (e.g. chr
)
str(frogs)
spc_tbl_ [2,763 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ recordID : chr [1:2763] "00052544-d943-42e9-bd85-83693c6dd824" "00168ca6-84d0-4af1-8fa8-875fd69d25da" "001a43fe-8586-4064-9e76-7373a837a759" "00250163-ec50-4eda-a5d5-58ae98bc5834" ...
$ scientificName : chr [1:2763] "Litoria ewingii" "Litoria raniformis" "Litoria ewingii" "Litoria raniformis" ...
$ taxonConceptID : chr [1:2763] "https://biodiversity.org.au/afd/taxa/1c89eedb-42b6-4675-b688-2e0b59ea689e" "https://biodiversity.org.au/afd/taxa/89a7a289-bf04-40e0-aaef-7ec6bc968a9c" "https://biodiversity.org.au/afd/taxa/1c89eedb-42b6-4675-b688-2e0b59ea689e" "https://biodiversity.org.au/afd/taxa/89a7a289-bf04-40e0-aaef-7ec6bc968a9c" ...
$ decimalLatitude : num [1:2763] -42.9 -41.2 -43 -41.2 -42.9 ...
$ decimalLongitude: num [1:2763] 147 146 147 147 148 ...
$ eventDate : POSIXct[1:2763], format: "2022-09-19 00:00:00" "2023-12-20 23:20:19" ...
$ occurrenceStatus: chr [1:2763] "PRESENT" "PRESENT" "PRESENT" "PRESENT" ...
$ dataResourceName: chr [1:2763] "FrogID" "iNaturalist Australia" "FrogID" "iNaturalist Australia" ...
$ genus : chr [1:2763] "Litoria" "Litoria" "Litoria" "Litoria" ...
$ species : chr [1:2763] "Litoria ewingii" "Litoria raniformis" "Litoria ewingii" "Litoria raniformis" ...
- attr(*, "spec")=
.. cols(
.. recordID = col_character(),
.. scientificName = col_character(),
.. taxonConceptID = col_character(),
.. decimalLatitude = col_double(),
.. decimalLongitude = col_double(),
.. eventDate = col_datetime(format = ""),
.. occurrenceStatus = col_character(),
.. dataResourceName = col_character(),
.. genus = col_character(),
.. species = col_character()
.. )
- attr(*, "problems")=<externalptr>
- attr(*, "doi")= chr "https://doi.org /10.26197/ala.dde3e89f-28bf-4515-9128-ab18bce19a08"
The skim()
function groups columns by their type/class.
library(skimr)
skim(frogs)
Name | frogs |
Number of rows | 2763 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 7 |
numeric | 2 |
POSIXct | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
recordID | 0 | 1 | 36 | 36 | 0 | 2763 | 0 |
scientificName | 0 | 1 | 7 | 18 | 0 | 5 | 0 |
taxonConceptID | 0 | 1 | 73 | 73 | 0 | 5 | 0 |
occurrenceStatus | 0 | 1 | 7 | 7 | 0 | 1 | 0 |
dataResourceName | 0 | 1 | 6 | 27 | 0 | 3 | 0 |
genus | 0 | 1 | 7 | 7 | 0 | 1 | 0 |
species | 1 | 1 | 15 | 18 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
decimalLatitude | 0 | 1 | -42.17 | 0.85 | -43.49 | -42.95 | -42.50 | -41.34 | -39.64 | ▇▁▆▂▁ |
decimalLongitude | 0 | 1 | 147.01 | 0.70 | 143.84 | 146.92 | 147.15 | 147.34 | 148.30 | ▁▁▁▇▂ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
eventDate | 0 | 1 | 2020-01-02 | 2024-06-05 12:06:00 | 2021-11-16 | 1181 |
You can return the class of every column using sapply()
from base R.
sapply(frogs, class)
$recordID
[1] "character"
$scientificName
[1] "character"
$taxonConceptID
[1] "character"
$decimalLatitude
[1] "numeric"
$decimalLongitude
[1] "numeric"
$eventDate
[1] "POSIXct" "POSIXt"
$occurrenceStatus
[1] "character"
$dataResourceName
[1] "character"
$genus
[1] "character"
$species
[1] "character"
You can return the class of every column using map()
from the purrr package.
library(purrr)
|>
frogs ::map(class) purrr
$recordID
[1] "character"
$scientificName
[1] "character"
$taxonConceptID
[1] "character"
$decimalLatitude
[1] "numeric"
$decimalLongitude
[1] "numeric"
$eventDate
[1] "POSIXct" "POSIXt"
$occurrenceStatus
[1] "character"
$dataResourceName
[1] "character"
$genus
[1] "character"
$species
[1] "character"
If you are using a tibble
, the class is also displayed below each column name when you view your table. Depending on whether your output is in the console or inline, your tibble may be formatted as a paged table in R Studio.
frogs
# A tibble: 2,763 × 10
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 00052544-d943… Litoria ewing… https://biodi… -42.9 147.
2 00168ca6-84d0… Litoria ranif… https://biodi… -41.2 146.
3 001a43fe-8586… Litoria ewing… https://biodi… -43.0 147.
4 00250163-ec50… Litoria ranif… https://biodi… -41.2 147.
5 003e0f63-9f95… Litoria ewing… https://biodi… -42.9 148.
6 0070521f-bb45… Litoria ewing… https://biodi… -43.1 147.
7 00898021-7ad3… Litoria ewing… https://biodi… -42.8 147.
8 00a64bc4-f727… Litoria ewing… https://biodi… -43.1 148.
9 00ab3454-6e77… Litoria ewing… https://biodi… -41.6 147.
10 00d33f5b-9cd7… Litoria ewing… https://biodi… -43.0 147.
# ℹ 2,753 more rows
# ℹ 5 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, genus <chr>, species <chr>
|> rmarkdown::paged_table() frogs
From these quick overviews of the data, we’ve learned:
- Column
scientificName
is strings of text (typecharacter
) - Columns
decimalLatitude
anddecimalLongitude
are numbers with decimal points (typedouble
) - The
eventDate
column contains a date + time (typePOSIXct
/dttm
) - Columns like
recordID
andtaxonConceptID
contain both text and numbers, but are of typecharacter
because this type prevents any loss of data1.
Here, the column classes are what we’d expect given the types of data in each column. However, this is not always the case.
For instance, changing just one of the values in decimalLatitude
from its assigned numeric
value to a “degrees minutes seconds” format causes the entire column class to be changed to character
to prevent loss of data.
# duplicate data
<- frogs
frogs_class
# check class
class(frogs_class$decimalLatitude)
[1] "numeric"
# change one of the values to a degrees minutes seconds format
$decimalLatitude[5] <- "40° 51' 59 N"
frogs_class
# check class
class(frogs_class$decimalLatitude)
[1] "character"
A simple typo in the dataset you import into R could be all it takes to change the class of an entire column, so be sure to keep your eyes out for unexpected column classes!
3.2 Column names
There are many reasons why you might need to change the name of one or more columns in a table. We’ve outlined a few of the more common use cases here.
3.2.1 Make column names consistent
Column names should use consistent naming conventions. R is case sensitive, so two names with the same letters but different capitalisations are considered different names (e.g. event
vs. Event
). Using a naming convention which is both human- and machine-readable (e.g. camel case, snake case), and being consistent in your usage of it, makes it less likely that you will make these sorts of errors.
Camel case begins in lowercase and uses uppercase for the first letter of every subsequent word (e.g. scientificName
, dataResourceName
, eventDate
).
Snake case uses lowercase letters only, with words separated by an underscore _
(e.g. scientific_name
, data_resource_name
, event_date
).
Snake case is more popular in R, and is the naming convention we recommend. Data downloaded from the ALA is in camel case2.
colnames(frogs)
[1] "recordID" "scientificName" "taxonConceptID" "decimalLatitude"
[5] "decimalLongitude" "eventDate" "occurrenceStatus" "dataResourceName"
[9] "genus" "species"
One of the most useful column name cleaning functions is clean_names()
from the janitor package. This function will make all of your column names consistent, based on your preferred naming convention (defaults to snake case).
library(janitor)
<- frogs |>
frogs_clean clean_names() |>
colnames()
frogs_clean
[1] "record_id" "scientific_name" "taxon_concept_id"
[4] "decimal_latitude" "decimal_longitude" "event_date"
[7] "occurrence_status" "data_resource_name" "genus"
[10] "species"
Now our names are in a consistent snake_case
format.
Code
|>
frogs clean_names() |>
::paged_table() # nice format rmarkdown