3  Column classes & names

Each column in a dataset contains values of a specific type, or class. A class defines the type of data in a column and determines how those data are interpreted in R, and how we can modify those data. For instance, it doesn’t make sense to apply a mathematical equation to a word or a sentence. The phrase “hello” doesn’t reveal whether something is true or false. Knowing what types, or classes, of data are in each column of your table will ensure those data behave as expected later on. Classes are important to understand because, generally, functions only work on compatible data types.

Column names can also cause compatibility issues when working with a dataset. Depending on the source of your dataset, existing column names may be uninformative (e.g. col1, tga42.D), oddly formatted once imported into R (e.g. How.Much.Soil.Is.In.This.Plot..), or internally inconsistent (e.g. species_name, scientificName). Modifying these can make it much easier to work with the data and avoid errors caused by mismatched or confusing column names.

This chapter explains how to check the class of each column and edit column names so that they are consistent and ready to use for analyses.

3.0.1 Prerequisites

In this chapter, we will use Litoria frog occurrence data since 2020 in Tasmania from the ALA.

# packages
library(galah)
library(dplyr)
galah_config(email = "your-email-here") # ALA-registered email

frogs <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.dde3e89f-28bf-4515-9128-ab18bce19a08") |>
  atlas_occurrences()

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

frogs <- galah_call() |>
  identify("Litoria") |>
  filter(year >= 2020, 
         cl22 == "Tasmania") |>
  select(group = "basic",
         genus, species) |>
  atlas_occurrences()
1
We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

3.1 Column classes

Columns define what type of data they contain by having a class, and it’s important to know what these classes are because R handles each class differently.

Viewing your data using functions we introduced in the Inspect chapter allows you to get a quick overview of each column’s class.

If you are using a tibble, the class is also displayed below each column name when you view your table. Depending on whether your output is in the console or inline, your tibble may be formatted as a paged table in R Studio.

Our data classes:

From these quick overviews of the data, we’ve learned:

  • Column scientificName is strings of text (type character)
  • Columns decimalLatitude and decimalLongitude are numbers with decimal points (type double)
  • The eventDate column contains a date + time (type POSIXct/dttm)
  • Columns like recordID and taxonConceptID contain both text and numbers, but are of type character because this type prevents any loss of data1.

Here, the column classes are what we’d expect given the types of data in each column. However, this is not always the case.

For instance, changing just one of the values in decimalLatitude from its assigned numeric value to a “degrees minutes seconds” format causes the entire column class to be changed to character to prevent loss of data.

# duplicate data
frogs_class <- frogs

# check class
class(frogs_class$decimalLatitude)
[1] "numeric"
# change one of the values to a degrees minutes seconds format
frogs_class$decimalLatitude[5] <- "40° 51' 59 N"

# check class
class(frogs_class$decimalLatitude)
[1] "character"

A simple typo in the dataset you import into R could be all it takes to change the class of an entire column, so be sure to keep your eyes out for unexpected column classes!

3.2 Column names

There are many reasons why you might need to change the name of one or more columns in a table. We’ve outlined a few of the more common use cases here.

3.2.1 Make column names consistent

Column names should use consistent naming conventions. R is case sensitive, so two names with the same letters but different capitalisations are considered different names (e.g. event vs. Event). Using a naming convention which is both human- and machine-readable (e.g. camel case, snake case), and being consistent in your usage of it, makes it less likely that you will make these sorts of errors.

Camel case begins in lowercase and uses uppercase for the first letter of every subsequent word (e.g. scientificName, dataResourceName, eventDate).

Snake case uses lowercase letters only, with words separated by an underscore _ (e.g. scientific_name, data_resource_name, event_date).

Snake case is more popular in R, and is the naming convention we recommend. Data downloaded from the ALA is in camel case2.

colnames(frogs)
 [1] "recordID"         "scientificName"   "taxonConceptID"   "decimalLatitude" 
 [5] "decimalLongitude" "eventDate"        "occurrenceStatus" "dataResourceName"
 [9] "genus"            "species"         

One of the most useful column name cleaning functions is clean_names() from the janitor package. This function will make all of your column names consistent, based on your preferred naming convention (defaults to snake case).

library(janitor)

frogs_clean <- frogs |>
  clean_names() |>
  colnames()
frogs_clean
 [1] "record_id"          "scientific_name"    "taxon_concept_id"  
 [4] "decimal_latitude"   "decimal_longitude"  "event_date"        
 [7] "occurrence_status"  "data_resource_name" "genus"             
[10] "species"           

Now our names are in a consistent snake_case format.

Code
frogs |>
  clean_names() |>
  rmarkdown::paged_table() # nice format