# packages
library(galah)
library(dplyr)
galah_config(email = "your-email-here") # ALA-registered email
<- galah_call() |>
frogs filter(doi == "https://doi.org /10.26197/ala.dde3e89f-28bf-4515-9128-ab18bce19a08") |>
atlas_occurrences()
3 Column classes & names
Each column in a dataset contains values of a specific type, or class. A class defines the type of data in a column and determines how those data are interpreted in R, and how we can modify those data. For instance, it doesn’t make sense to apply a mathematical equation to a word or a sentence. The phrase “hello” doesn’t reveal whether something is true or false. Knowing what types, or classes, of data are in each column of your table will ensure those data behave as expected later on. Classes are important to understand because, generally, functions only work on compatible data types.
Column names can also cause compatibility issues when working with a dataset. Depending on the source of your dataset, existing column names may be uninformative (e.g. col1
, tga42.D
), oddly formatted once imported into R (e.g. How.Much.Soil.Is.In.This.Plot..
), or internally inconsistent (e.g. species_name
, scientificName
). Modifying these can make it much easier to work with the data and avoid errors caused by mismatched or confusing column names.
This chapter explains how to check the class of each column and edit column names so that they are consistent and ready to use for analyses.
3.0.1 Prerequisites
In this chapter, we will use Litoria frog occurrence data since 2020 in Tasmania from the ALA.
3.1 Column classes
Columns define what type of data they contain by having a class, and it’s important to know what these classes are because R handles each class differently.
Viewing your data using functions we introduced in the Inspect chapter allows you to get a quick overview of each column’s class.
Using glimpse()
displays the class beside each column name (e.g. <chr>)
library(dplyr)
glimpse(frogs)
Rows: 2,763
Columns: 10
$ recordID <chr> "00052544-d943-42e9-bd85-83693c6dd824", "00168ca6-84d…
$ scientificName <chr> "Litoria ewingii", "Litoria raniformis", "Litoria ewi…
$ taxonConceptID <chr> "https://biodiversity.org.au/afd/taxa/1c89eedb-42b6-4…
$ decimalLatitude <dbl> -42.87917, -41.19207, -42.98559, -41.15305, -42.85886…
$ decimalLongitude <dbl> 147.4754, 146.4331, 147.0589, 146.5241, 147.6137, 147…
$ eventDate <dttm> 2022-09-19 00:00:00, 2023-12-20 23:20:19, 2021-08-07…
$ occurrenceStatus <chr> "PRESENT", "PRESENT", "PRESENT", "PRESENT", "PRESENT"…
$ dataResourceName <chr> "FrogID", "iNaturalist Australia", "FrogID", "iNatura…
$ genus <chr> "Litoria", "Litoria", "Litoria", "Litoria", "Litoria"…
$ species <chr> "Litoria ewingii", "Litoria raniformis", "Litoria ewi…
Using str()
displays the class after the column name and before the number of rows (e.g. chr
)
str(frogs)
spc_tbl_ [2,763 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ recordID : chr [1:2763] "00052544-d943-42e9-bd85-83693c6dd824" "00168ca6-84d0-4af1-8fa8-875fd69d25da" "001a43fe-8586-4064-9e76-7373a837a759" "00250163-ec50-4eda-a5d5-58ae98bc5834" ...
$ scientificName : chr [1:2763] "Litoria ewingii" "Litoria raniformis" "Litoria ewingii" "Litoria raniformis" ...
$ taxonConceptID : chr [1:2763] "https://biodiversity.org.au/afd/taxa/1c89eedb-42b6-4675-b688-2e0b59ea689e" "https://biodiversity.org.au/afd/taxa/89a7a289-bf04-40e0-aaef-7ec6bc968a9c" "https://biodiversity.org.au/afd/taxa/1c89eedb-42b6-4675-b688-2e0b59ea689e" "https://biodiversity.org.au/afd/taxa/89a7a289-bf04-40e0-aaef-7ec6bc968a9c" ...
$ decimalLatitude : num [1:2763] -42.9 -41.2 -43 -41.2 -42.9 ...
$ decimalLongitude: num [1:2763] 147 146 147 147 148 ...
$ eventDate : POSIXct[1:2763], format: "2022-09-19 00:00:00" "2023-12-20 23:20:19" ...
$ occurrenceStatus: chr [1:2763] "PRESENT" "PRESENT" "PRESENT" "PRESENT" ...
$ dataResourceName: chr [1:2763] "FrogID" "iNaturalist Australia" "FrogID" "iNaturalist Australia" ...
$ genus : chr [1:2763] "Litoria" "Litoria" "Litoria" "Litoria" ...
$ species : chr [1:2763] "Litoria ewingii" "Litoria raniformis" "Litoria ewingii" "Litoria raniformis" ...
- attr(*, "spec")=
.. cols(
.. recordID = col_character(),
.. scientificName = col_character(),
.. taxonConceptID = col_character(),
.. decimalLatitude = col_double(),
.. decimalLongitude = col_double(),
.. eventDate = col_datetime(format = ""),
.. occurrenceStatus = col_character(),
.. dataResourceName = col_character(),
.. genus = col_character(),
.. species = col_character()
.. )
- attr(*, "problems")=<externalptr>
- attr(*, "doi")= chr "https://doi.org /10.26197/ala.dde3e89f-28bf-4515-9128-ab18bce19a08"
The skim()
function groups columns by their type/class.
library(skimr)
skim(frogs)
Name | frogs |
Number of rows | 2763 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 7 |
numeric | 2 |
POSIXct | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
recordID | 0 | 1 | 36 | 36 | 0 | 2763 | 0 |
scientificName | 0 | 1 | 7 | 18 | 0 | 5 | 0 |
taxonConceptID | 0 | 1 | 73 | 73 | 0 | 5 | 0 |
occurrenceStatus | 0 | 1 | 7 | 7 | 0 | 1 | 0 |
dataResourceName | 0 | 1 | 6 | 27 | 0 | 3 | 0 |
genus | 0 | 1 | 7 | 7 | 0 | 1 | 0 |
species | 1 | 1 | 15 | 18 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
decimalLatitude | 0 | 1 | -42.17 | 0.85 | -43.49 | -42.95 | -42.50 | -41.34 | -39.64 | ▇▁▆▂▁ |
decimalLongitude | 0 | 1 | 147.01 | 0.70 | 143.84 | 146.92 | 147.15 | 147.34 | 148.30 | ▁▁▁▇▂ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
eventDate | 0 | 1 | 2020-01-02 | 2024-06-05 12:06:00 | 2021-11-16 | 1181 |
You can return the class of every column using sapply()
from base R.
sapply(frogs, class)
$recordID
[1] "character"
$scientificName
[1] "character"
$taxonConceptID
[1] "character"
$decimalLatitude
[1] "numeric"
$decimalLongitude
[1] "numeric"
$eventDate
[1] "POSIXct" "POSIXt"
$occurrenceStatus
[1] "character"
$dataResourceName
[1] "character"
$genus
[1] "character"
$species
[1] "character"
You can return the class of every column using map()
from the purrr package.
library(purrr)
|>
frogs ::map(class) purrr
$recordID
[1] "character"
$scientificName
[1] "character"
$taxonConceptID
[1] "character"
$decimalLatitude
[1] "numeric"
$decimalLongitude
[1] "numeric"
$eventDate
[1] "POSIXct" "POSIXt"
$occurrenceStatus
[1] "character"
$dataResourceName
[1] "character"
$genus
[1] "character"
$species
[1] "character"
If you are using a tibble
, the class is also displayed below each column name when you view your table. Depending on whether your output is in the console or inline, your tibble may be formatted as a paged table in R Studio.
frogs
# A tibble: 2,763 × 10
recordID scientificName taxonConceptID decimalLatitude decimalLongitude
<chr> <chr> <chr> <dbl> <dbl>
1 00052544-d943… Litoria ewing… https://biodi… -42.9 147.
2 00168ca6-84d0… Litoria ranif… https://biodi… -41.2 146.
3 001a43fe-8586… Litoria ewing… https://biodi… -43.0 147.
4 00250163-ec50… Litoria ranif… https://biodi… -41.2 147.
5 003e0f63-9f95… Litoria ewing… https://biodi… -42.9 148.
6 0070521f-bb45… Litoria ewing… https://biodi… -43.1 147.
7 00898021-7ad3… Litoria ewing… https://biodi… -42.8 147.
8 00a64bc4-f727… Litoria ewing… https://biodi… -43.1 148.
9 00ab3454-6e77… Litoria ewing… https://biodi… -41.6 147.
10 00d33f5b-9cd7… Litoria ewing… https://biodi… -43.0 147.
# ℹ 2,753 more rows
# ℹ 5 more variables: eventDate <dttm>, occurrenceStatus <chr>,
# dataResourceName <chr>, genus <chr>, species <chr>
|> rmarkdown::paged_table() frogs
From these quick overviews of the data, we’ve learned:
- Column
scientificName
is strings of text (typecharacter
) - Columns
decimalLatitude
anddecimalLongitude
are numbers with decimal points (typedouble
) - The
eventDate
column contains a date + time (typePOSIXct
/dttm
) - Columns like
recordID
andtaxonConceptID
contain both text and numbers, but are of typecharacter
because this type prevents any loss of data1.
Here, the column classes are what we’d expect given the types of data in each column. However, this is not always the case.
For instance, changing just one of the values in decimalLatitude
from its assigned numeric
value to a “degrees minutes seconds” format causes the entire column class to be changed to character
to prevent loss of data.
# duplicate data
<- frogs
frogs_class
# check class
class(frogs_class$decimalLatitude)
[1] "numeric"
# change one of the values to a degrees minutes seconds format
$decimalLatitude[5] <- "40° 51' 59 N"
frogs_class
# check class
class(frogs_class$decimalLatitude)
[1] "character"
A simple typo in the dataset you import into R could be all it takes to change the class of an entire column, so be sure to keep your eyes out for unexpected column classes!
3.2 Column names
There are many reasons why you might need to change the name of one or more columns in a table. We’ve outlined a few of the more common use cases here.
3.2.1 Make column names consistent
Column names should use consistent naming conventions. R is case sensitive, so two names with the same letters but different capitalisations are considered different names (e.g. event
vs. Event
). Using a naming convention which is both human- and machine-readable (e.g. camel case, snake case), and being consistent in your usage of it, makes it less likely that you will make these sorts of errors.
Camel case begins in lowercase and uses uppercase for the first letter of every subsequent word (e.g. scientificName
, dataResourceName
, eventDate
).
Snake case uses lowercase letters only, with words separated by an underscore _
(e.g. scientific_name
, data_resource_name
, event_date
).
Snake case is more popular in R, and is the naming convention we recommend. Data downloaded from the ALA is in camel case2.
colnames(frogs)
[1] "recordID" "scientificName" "taxonConceptID" "decimalLatitude"
[5] "decimalLongitude" "eventDate" "occurrenceStatus" "dataResourceName"
[9] "genus" "species"
One of the most useful column name cleaning functions is clean_names()
from the janitor package. This function will make all of your column names consistent, based on your preferred naming convention (defaults to snake case).
library(janitor)
<- frogs |>
frogs_clean clean_names() |>
colnames()
frogs_clean
[1] "record_id" "scientific_name" "taxon_concept_id"
[4] "decimal_latitude" "decimal_longitude" "event_date"
[7] "occurrence_status" "data_resource_name" "genus"
[10] "species"
Now our names are in a consistent snake_case
format.
Code
|>
frogs clean_names() |>
::paged_table() # nice format rmarkdown
3.2.2 Rename columns
Renaming columns is a common data cleaning task. It may be necessary to rename columns to clarify the data they contain or to ensure consistency with another dataset before merging them.
There are several ways to rename columns in R.
dplyr::rename()
provides an easy way to rename one or more columns.
|>
frogs select(decimalLatitude, decimalLongitude) |>
rename(latitude = decimalLatitude,
longitude = decimalLongitude)
# A tibble: 2,763 × 2
latitude longitude
<dbl> <dbl>
1 -42.9 147.
2 -41.2 146.
3 -43.0 147.
4 -41.2 147.
5 -42.9 148.
6 -43.1 147.
7 -42.8 147.
8 -43.1 148.
9 -41.6 147.
10 -43.0 147.
# ℹ 2,753 more rows
rename_with()
is a more powerful version of rename()
. It allows more advanced renaming by using functions to rename matching columns. Here we convert column names starting with “decimal” to uppercase.
|>
frogs select(decimalLatitude, decimalLongitude) |>
rename_with(toupper, starts_with("decimal"))
# A tibble: 2,763 × 2
DECIMALLATITUDE DECIMALLONGITUDE
<dbl> <dbl>
1 -42.9 147.
2 -41.2 146.
3 -43.0 147.
4 -41.2 147.
5 -42.9 148.
6 -43.1 147.
7 -42.8 147.
8 -43.1 148.
9 -41.6 147.
10 -43.0 147.
# ℹ 2,753 more rows
And here we append the prefix “new_” to columns with names containing “decimal”, and convert them to lowercase.
|>
frogs select(decimalLatitude, decimalLongitude) |>
rename_with( ~ tolower(gsub("decimal", "new_", .x, fixed = TRUE)))
- 1
-
.x
is shorthand for the variable this function will be applied to. In this case,.x
refers to ourfrogs
data frame.
# A tibble: 2,763 × 2
new_latitude new_longitude
<dbl> <dbl>
1 -42.9 147.
2 -41.2 146.
3 -43.0 147.
4 -41.2 147.
5 -42.9 148.
6 -43.1 147.
7 -42.8 147.
8 -43.1 148.
9 -41.6 147.
10 -43.0 147.
# ℹ 2,753 more rows
Index a specific column name in base R with the help of names()
. Assign a new column name to replace an old column name using the assignment operator <-
.
names(frogs)[names(frogs) == "decimalLatitude"] <- "latitude"
names(frogs)[names(frogs) == "decimalLongitude"] <- "longitude"
c("latitude", "longitude")] frogs[,
# A tibble: 2,763 × 2
latitude longitude
<dbl> <dbl>
1 -42.9 147.
2 -41.2 146.
3 -43.0 147.
4 -41.2 147.
5 -42.9 148.
6 -43.1 147.
7 -42.8 147.
8 -43.1 148.
9 -41.6 147.
10 -43.0 147.
# ℹ 2,753 more rows
3.2.3 Separate columns
Sometimes it is useful to split information from one column into several columns. One good example is if genus and species names are contained in one column like scientificName
. We can separate these names into two columns using separate()
from the tidyr package.
library(tidyr)
<- frogs |>
frogs_separate separate(scientificName,
c("genus", "species"), # new column names
fill = "right", # fill missing values in right column
remove = FALSE # keep input column
|>
) select(scientificName, genus, species)
|> rmarkdown::paged_table() # nice format frogs_separate
3.2.4 Join columns
Conversely, we might want to combine information from multiple columns into a single column. We can rejoin the genus
and species
columns we created in the previous section using unite()
from the tidyr package.
<- frogs_separate |>
frogs_united unite("single_name",
:species, # select columns to join
genussep = " ", # separate with a space
na.rm = TRUE, # remove NA values
remove = FALSE # keep input column
|>
) select(genus, species, single_name)
|> rmarkdown::paged_table() # nice format frogs_united
3.3 Summary
In this chapter, we explored different ways to check the class of each column in your table to make sure R is interpreting your data correctly. We also demonstrated how to rename columns for easier handling and how to split or combine columns to access data more conveniently.
In this chapter, we examined various methods to verify the class of each column in your table, ensuring that R correctly interprets your data. We also demonstrated how to rename columns for easier handling and how to split or combine columns for more convenient data access.
In the next chapter, we will learn how to efficiently clean duplicate data. Duplicates can arise from errors in data collection or entry, or from merging data from multiple sources.
To avoid conflicts, R has an internal coercion hierarchy rule to avoid data loss. The rule of thumb is that if a data type can’t exist in a child data type, then the parent data type is used instead. The R coercion hierarchy is:
logical
->
integer
->
numeric
->
complex
->
character
You don’t need to memorise this, but it’s worth being aware of this hierarchy, as R might make decisions to prevent a class error and you might not know why! Learn more in this article.↩︎Queries to the ALA use other coding languages, namely
solr
andJSON
, and column names in these languages are typically in camel case. To maintain consistency with what’s in the ALA and to avoid hidden name cleaning, galah also returns names in camel case.↩︎