3  Column classes & names

Each column in a dataset contains values of a specific type, or class. A class defines the type of data in a column and determines how those data are interpreted in R, and how we can modify those data. For instance, it doesn’t make sense to apply a mathematical equation to a word or a sentence. The phrase “hello” doesn’t reveal whether something is true or false. Knowing what types, or classes, of data are in each column of your table will ensure those data behave as expected later on. Classes are important to understand because, generally, functions only work on compatible data types.

Column names can also cause compatibility issues when working with a dataset. Depending on the source of your dataset, existing column names may be uninformative (e.g. col1, tga42.D), oddly formatted once imported into R (e.g. How.Much.Soil.Is.In.This.Plot..), or internally inconsistent (e.g. species_name, scientificName). Modifying these can make it much easier to work with the data and avoid errors caused by mismatched or confusing column names.

This chapter explains how to check the class of each column and edit column names so that they are consistent and ready to use for analyses.

3.0.1 Prerequisites

In this chapter, we will use Litoria frog occurrence data since 2020 in Tasmania from the ALA.

# packages
library(galah)
library(dplyr)
galah_config(email = "your-email-here") # ALA-registered email

frogs <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.dde3e89f-28bf-4515-9128-ab18bce19a08") |>
  atlas_occurrences()

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

frogs <- galah_call() |>
  identify("Litoria") |>
  filter(year >= 2020, 
         cl22 == "Tasmania") |>
  select(group = "basic",
         genus, species) |>
  atlas_occurrences()
1
We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

3.1 Column classes

Columns define what type of data they contain by having a class, and it’s important to know what these classes are because R handles each class differently.

Viewing your data using functions we introduced in the Inspect chapter allows you to get a quick overview of each column’s class.

If you are using a tibble, the class is also displayed below each column name when you view your table. Depending on whether your output is in the console or inline, your tibble may be formatted as a paged table in R Studio.

Our data classes:

From these quick overviews of the data, we’ve learned:

  • Column scientificName is strings of text (type character)
  • Columns decimalLatitude and decimalLongitude are numbers with decimal points (type double)
  • The eventDate column contains a date + time (type POSIXct/dttm)
  • Columns like recordID and taxonConceptID contain both text and numbers, but are of type character because this type prevents any loss of data1.

Here, the column classes are what we’d expect given the types of data in each column. However, this is not always the case.

For instance, changing just one of the values in decimalLatitude from its assigned numeric value to a “degrees minutes seconds” format causes the entire column class to be changed to character to prevent loss of data.

# duplicate data
frogs_class <- frogs

# check class
class(frogs_class$decimalLatitude)
[1] "numeric"
# change one of the values to a degrees minutes seconds format
frogs_class$decimalLatitude[5] <- "40° 51' 59 N"

# check class
class(frogs_class$decimalLatitude)
[1] "character"

A simple typo in the dataset you import into R could be all it takes to change the class of an entire column, so be sure to keep your eyes out for unexpected column classes!

3.2 Column names

There are many reasons why you might need to change the name of one or more columns in a table. We’ve outlined a few of the more common use cases here.

3.2.1 Make column names consistent

Column names should use consistent naming conventions. R is case sensitive, so two names with the same letters but different capitalisations are considered different names (e.g. event vs. Event). Using a naming convention which is both human- and machine-readable (e.g. camel case, snake case), and being consistent in your usage of it, makes it less likely that you will make these sorts of errors.

Camel case begins in lowercase and uses uppercase for the first letter of every subsequent word (e.g. scientificName, dataResourceName, eventDate).

Snake case uses lowercase letters only, with words separated by an underscore _ (e.g. scientific_name, data_resource_name, event_date).

Snake case is more popular in R, and is the naming convention we recommend. Data downloaded from the ALA is in camel case2.

colnames(frogs)
 [1] "recordID"         "scientificName"   "taxonConceptID"   "decimalLatitude" 
 [5] "decimalLongitude" "eventDate"        "occurrenceStatus" "dataResourceName"
 [9] "genus"            "species"         

One of the most useful column name cleaning functions is clean_names() from the janitor package. This function will make all of your column names consistent, based on your preferred naming convention (defaults to snake case).

library(janitor)

frogs_clean <- frogs |>
  clean_names() |>
  colnames()
frogs_clean
 [1] "record_id"          "scientific_name"    "taxon_concept_id"  
 [4] "decimal_latitude"   "decimal_longitude"  "event_date"        
 [7] "occurrence_status"  "data_resource_name" "genus"             
[10] "species"           

Now our names are in a consistent snake_case format.

Code
frogs |>
  clean_names() |>
  rmarkdown::paged_table() # nice format

3.2.2 Rename columns

Renaming columns is a common data cleaning task. It may be necessary to rename columns to clarify the data they contain or to ensure consistency with another dataset before merging them.

There are several ways to rename columns in R.

3.2.3 Separate columns

Sometimes it is useful to split information from one column into several columns. One good example is if genus and species names are contained in one column like scientificName. We can separate these names into two columns using separate() from the tidyr package.

library(tidyr)

frogs_separate <- frogs |>
  separate(scientificName, 
           c("genus", "species"), # new column names
           fill = "right",        # fill missing values in right column
           remove = FALSE         # keep input column
           ) |> 
  select(scientificName, genus, species)

frogs_separate |> rmarkdown::paged_table() # nice format

3.2.4 Join columns

Conversely, we might want to combine information from multiple columns into a single column. We can rejoin the genus and species columns we created in the previous section using unite() from the tidyr package.

frogs_united <- frogs_separate |>
  unite("single_name", 
        genus:species, # select columns to join
        sep = " ",     # separate with a space
        na.rm = TRUE,  # remove NA values
        remove = FALSE # keep input column
        ) |>
  select(genus, species, single_name)

frogs_united |> rmarkdown::paged_table() # nice format

3.3 Summary

In this chapter, we explored different ways to check the class of each column in your table to make sure R is interpreting your data correctly. We also demonstrated how to rename columns for easier handling and how to split or combine columns to access data more conveniently.

In this chapter, we examined various methods to verify the class of each column in your table, ensuring that R correctly interprets your data. We also demonstrated how to rename columns for easier handling and how to split or combine columns for more convenient data access.

In the next chapter, we will learn how to efficiently clean duplicate data. Duplicates can arise from errors in data collection or entry, or from merging data from multiple sources.


  1. To avoid conflicts, R has an internal coercion hierarchy rule to avoid data loss. The rule of thumb is that if a data type can’t exist in a child data type, then the parent data type is used instead. The R coercion hierarchy is:
    logical -> integer -> numeric -> complex -> character

    You don’t need to memorise this, but it’s worth being aware of this hierarchy, as R might make decisions to prevent a class error and you might not know why! Learn more in this article.↩︎

  2. Queries to the ALA use other coding languages, namely solr and JSON, and column names in these languages are typically in camel case. To maintain consistency with what’s in the ALA and to avoid hidden name cleaning, galah also returns names in camel case.↩︎