3 Column classes & names

Each column in a dataset contains values of a specific type, or class. A class defines the type of data in a column and determines how those data are interpreted in R, and how we can modify those data. For instance, it doesn’t make sense to apply a mathematical equation to a word or a sentence. The phrase “hello” doesn’t reveal whether something is true or false. Knowing what types, or classes, of data are in each column of your table will ensure those data behave as expected later on. Classes are important to understand because, generally, functions only work on compatible data types.

Column names can also cause compatibility issues when working with a dataset. Depending on the source of your dataset, existing column names may be uninformative (e.g. col1, tga42.D), oddly formatted once imported into R (e.g. How.Much.Soil.Is.In.This.Plot..), or internally inconsistent (e.g. species_name, scientificName). Modifying these can make it much easier to work with the data and avoid errors caused by mismatched or confusing column names.

This chapter explains how to check the class of each column and edit column names so that they are consistent and ready to use for analyses.

3.0.1 Prerequisites

In this chapter, we will use Litoria frog occurrence data since 2020 in Tasmania from the ALA.

# packages
library(galah)
library(dplyr)
galah_config(email = "your-email-here") # ALA-registered email

frogs <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.dde3e89f-28bf-4515-9128-ab18bce19a08") |>
  atlas_occurrences()

Litoria watjulumensis seated on a rock.
Photo by simono CC-BY-NC 4.0 (Int)

Original download queries

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

frogs <- galah_call() |>
  identify("Litoria") |>
  filter(year >= 2020, 
         cl22 == "Tasmania") |>
  select(group = "basic",
         genus, species) |>
  atlas_occurrences()

1: We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

3.1 Column classes

Columns define what type of data they contain by having a class, and it’s important to know what these classes are because R handles each class differently.

Viewing your data using functions we introduced in the Inspect chapter allows you to get a quick overview of each column’s class.

Using glimpse() displays the class beside each column name (e.g. <chr>)

library(dplyr)

glimpse(frogs)

Rows: 2,763
Columns: 10
$ recordID         <chr> "00052544-d943-42e9-bd85-83693c6dd824", "00168ca6-84d…
$ scientificName   <chr> "Litoria ewingii", "Litoria raniformis", "Litoria ewi…
$ taxonConceptID   <chr> "https://biodiversity.org.au/afd/taxa/1c89eedb-42b6-4…
$ decimalLatitude  <dbl> -42.87917, -41.19207, -42.98559, -41.15305, -42.85886…
$ decimalLongitude <dbl> 147.4754, 146.4331, 147.0589, 146.5241, 147.6137, 147…
$ eventDate        <dttm> 2022-09-19 00:00:00, 2023-12-20 23:20:19, 2021-08-07…
$ occurrenceStatus <chr> "PRESENT", "PRESENT", "PRESENT", "PRESENT", "PRESENT"…
$ dataResourceName <chr> "FrogID", "iNaturalist Australia", "FrogID", "iNatura…
$ genus            <chr> "Litoria", "Litoria", "Litoria", "Litoria", "Litoria"…
$ species          <chr> "Litoria ewingii", "Litoria raniformis", "Litoria ewi…

Using str() displays the class after the column name and before the number of rows (e.g. chr)

str(frogs)

spc_tbl_ [2,763 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ recordID        : chr [1:2763] "00052544-d943-42e9-bd85-83693c6dd824" "00168ca6-84d0-4af1-8fa8-875fd69d25da" "001a43fe-8586-4064-9e76-7373a837a759" "00250163-ec50-4eda-a5d5-58ae98bc5834" ...
 $ scientificName  : chr [1:2763] "Litoria ewingii" "Litoria raniformis" "Litoria ewingii" "Litoria raniformis" ...
 $ taxonConceptID  : chr [1:2763] "https://biodiversity.org.au/afd/taxa/1c89eedb-42b6-4675-b688-2e0b59ea689e" "https://biodiversity.org.au/afd/taxa/89a7a289-bf04-40e0-aaef-7ec6bc968a9c" "https://biodiversity.org.au/afd/taxa/1c89eedb-42b6-4675-b688-2e0b59ea689e" "https://biodiversity.org.au/afd/taxa/89a7a289-bf04-40e0-aaef-7ec6bc968a9c" ...
 $ decimalLatitude : num [1:2763] -42.9 -41.2 -43 -41.2 -42.9 ...
 $ decimalLongitude: num [1:2763] 147 146 147 147 148 ...
 $ eventDate       : POSIXct[1:2763], format: "2022-09-19 00:00:00" "2023-12-20 23:20:19" ...
 $ occurrenceStatus: chr [1:2763] "PRESENT" "PRESENT" "PRESENT" "PRESENT" ...
 $ dataResourceName: chr [1:2763] "FrogID" "iNaturalist Australia" "FrogID" "iNaturalist Australia" ...
 $ genus           : chr [1:2763] "Litoria" "Litoria" "Litoria" "Litoria" ...
 $ species         : chr [1:2763] "Litoria ewingii" "Litoria raniformis" "Litoria ewingii" "Litoria raniformis" ...
 - attr(*, "spec")=
  .. cols(
  ..   recordID = col_character(),
  ..   scientificName = col_character(),
  ..   taxonConceptID = col_character(),
  ..   decimalLatitude = col_double(),
  ..   decimalLongitude = col_double(),
  ..   eventDate = col_datetime(format = ""),
  ..   occurrenceStatus = col_character(),
  ..   dataResourceName = col_character(),
  ..   genus = col_character(),
  ..   species = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
 - attr(*, "doi")= chr "https://doi.org /10.26197/ala.dde3e89f-28bf-4515-9128-ab18bce19a08"

The skim() function groups columns by their type/class.

library(skimr)

skim(frogs)

Data summary
Name	frogs
Number of rows	2763
Number of columns	10
_______________________
Column type frequency:
character	7
numeric	2
POSIXct	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
recordID	0	1	36	36	2763
scientificName	0	1	7	18	5
taxonConceptID	0	1	73	73	5
occurrenceStatus	0	1	7	7	1
dataResourceName	0	1	6	27	3
genus	0	1	7	7	1
species	1	1	15	18	4

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
decimalLatitude	0	1	-42.17	0.85	-43.49	-42.95	-42.50	-41.34	-39.64	▇▁▆▂▁
decimalLongitude	0	1	147.01	0.70	143.84	146.92	147.15	147.34	148.30	▁▁▁▇▂

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
eventDate	0	1	2020-01-02	2024-06-05 12:06:00	2021-11-16	1181

You can return the class of every column using sapply() from base R.

sapply(frogs, class)

$recordID
[1] "character"

$scientificName
[1] "character"

$taxonConceptID
[1] "character"

$decimalLatitude
[1] "numeric"

$decimalLongitude
[1] "numeric"

$eventDate
[1] "POSIXct" "POSIXt" 

$occurrenceStatus
[1] "character"

$dataResourceName
[1] "character"

$genus
[1] "character"

$species
[1] "character"

You can return the class of every column using map() from the purrr package.

library(purrr)

frogs |>
  purrr::map(class)

$recordID
[1] "character"

$scientificName
[1] "character"

$taxonConceptID
[1] "character"

$decimalLatitude
[1] "numeric"

$decimalLongitude
[1] "numeric"

$eventDate
[1] "POSIXct" "POSIXt" 

$occurrenceStatus
[1] "character"

$dataResourceName
[1] "character"

$genus
[1] "character"

$species
[1] "character"

If you are using a tibble, the class is also displayed below each column name when you view your table. Depending on whether your output is in the console or inline, your tibble may be formatted as a paged table in R Studio.

Console
Inline

frogs

# A tibble: 2,763 × 10
   recordID       scientificName taxonConceptID decimalLatitude decimalLongitude
   <chr>          <chr>          <chr>                    <dbl>            <dbl>
 1 00052544-d943… Litoria ewing… https://biodi…           -42.9             147.
 2 00168ca6-84d0… Litoria ranif… https://biodi…           -41.2             146.
 3 001a43fe-8586… Litoria ewing… https://biodi…           -43.0             147.
 4 00250163-ec50… Litoria ranif… https://biodi…           -41.2             147.
 5 003e0f63-9f95… Litoria ewing… https://biodi…           -42.9             148.
 6 0070521f-bb45… Litoria ewing… https://biodi…           -43.1             147.
 7 00898021-7ad3… Litoria ewing… https://biodi…           -42.8             147.
 8 00a64bc4-f727… Litoria ewing… https://biodi…           -43.1             148.
 9 00ab3454-6e77… Litoria ewing… https://biodi…           -41.6             147.
10 00d33f5b-9cd7… Litoria ewing… https://biodi…           -43.0             147.
# ℹ 2,753 more rows
# ℹ 5 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, genus <chr>, species <chr>

frogs |> rmarkdown::paged_table()

Our data classes:

From these quick overviews of the data, we’ve learned:

Column scientificName is strings of text (type character)
Columns decimalLatitude and decimalLongitude are numbers with decimal points (type double)
The eventDate column contains a date + time (type POSIXct/dttm)
Columns like recordID and taxonConceptID contain both text and numbers, but are of type character because this type prevents any loss of data¹.

Here, the column classes are what we’d expect given the types of data in each column. However, this is not always the case.

For instance, changing just one of the values in decimalLatitude from its assigned numeric value to a “degrees minutes seconds” format causes the entire column class to be changed to character to prevent loss of data.

# duplicate data
frogs_class <- frogs

# check class
class(frogs_class$decimalLatitude)

[1] "numeric"

# change one of the values to a degrees minutes seconds format
frogs_class$decimalLatitude[5] <- "40° 51' 59 N"

# check class
class(frogs_class$decimalLatitude)

[1] "character"

Look out for typos

A simple typo in the dataset you import into R could be all it takes to change the class of an entire column, so be sure to keep your eyes out for unexpected column classes!

3.2 Column names

There are many reasons why you might need to change the name of one or more columns in a table. We’ve outlined a few of the more common use cases here.

3.2.1 Make column names consistent

Column names should use consistent naming conventions. R is case sensitive, so two names with the same letters but different capitalisations are considered different names (e.g. event vs. Event). Using a naming convention which is both human- and machine-readable (e.g. camel case, snake case), and being consistent in your usage of it, makes it less likely that you will make these sorts of errors.

Camel case begins in lowercase and uses uppercase for the first letter of every subsequent word (e.g. scientificName, dataResourceName, eventDate).

Snake case uses lowercase letters only, with words separated by an underscore _ (e.g. scientific_name, data_resource_name, event_date).

Snake case is more popular in R, and is the naming convention we recommend. Data downloaded from the ALA is in camel case².

colnames(frogs)

 [1] "recordID"         "scientificName"   "taxonConceptID"   "decimalLatitude" 
 [5] "decimalLongitude" "eventDate"        "occurrenceStatus" "dataResourceName"
 [9] "genus"            "species"

One of the most useful column name cleaning functions is clean_names() from the janitor package. This function will make all of your column names consistent, based on your preferred naming convention (defaults to snake case).

library(janitor)

frogs_clean <- frogs |>
  clean_names() |>
  colnames()
frogs_clean

 [1] "record_id"          "scientific_name"    "taxon_concept_id"  
 [4] "decimal_latitude"   "decimal_longitude"  "event_date"        
 [7] "occurrence_status"  "data_resource_name" "genus"             
[10] "species"

Now our names are in a consistent snake_case format.

Code

frogs |>
  clean_names() |>
  rmarkdown::paged_table() # nice format

Litoria ewingii nestled in the mud.
Photo by george_vaughan CC-BY-NC 4.0 (Int)

3.2.2 Rename columns

Renaming columns is a common data cleaning task. It may be necessary to rename columns to clarify the data they contain or to ensure consistency with another dataset before merging them.

There are several ways to rename columns in R.

rename()
names() + <-

dplyr::rename() provides an easy way to rename one or more columns.

frogs |>
  select(decimalLatitude, decimalLongitude) |>
  rename(latitude = decimalLatitude,
         longitude = decimalLongitude)

# A tibble: 2,763 × 2
   latitude longitude
      <dbl>     <dbl>
 1    -42.9      147.
 2    -41.2      146.
 3    -43.0      147.
 4    -41.2      147.
 5    -42.9      148.
 6    -43.1      147.
 7    -42.8      147.
 8    -43.1      148.
 9    -41.6      147.
10    -43.0      147.
# ℹ 2,753 more rows

rename_with() is a more powerful version of rename(). It allows more advanced renaming by using functions to rename matching columns. Here we convert column names starting with “decimal” to uppercase.

frogs |>
  select(decimalLatitude, decimalLongitude) |>
  rename_with(toupper, starts_with("decimal"))

# A tibble: 2,763 × 2
   DECIMALLATITUDE DECIMALLONGITUDE
             <dbl>            <dbl>
 1           -42.9             147.
 2           -41.2             146.
 3           -43.0             147.
 4           -41.2             147.
 5           -42.9             148.
 6           -43.1             147.
 7           -42.8             147.
 8           -43.1             148.
 9           -41.6             147.
10           -43.0             147.
# ℹ 2,753 more rows

And here we append the prefix “new_” to columns with names containing “decimal”, and convert them to lowercase.

frogs |>
  select(decimalLatitude, decimalLongitude) |>
  rename_with( ~ tolower(gsub("decimal", "new_", .x, fixed = TRUE)))

1: .x is shorthand for the variable this function will be applied to. In this case, .x refers to our frogs data frame.

# A tibble: 2,763 × 2
   new_latitude new_longitude
          <dbl>         <dbl>
 1        -42.9          147.
 2        -41.2          146.
 3        -43.0          147.
 4        -41.2          147.
 5        -42.9          148.
 6        -43.1          147.
 7        -42.8          147.
 8        -43.1          148.
 9        -41.6          147.
10        -43.0          147.
# ℹ 2,753 more rows

Index a specific column name in base R with the help of names(). Assign a new column name to replace an old column name using the assignment operator <-.

names(frogs)[names(frogs) == "decimalLatitude"] <- "latitude"
names(frogs)[names(frogs) == "decimalLongitude"] <- "longitude"

frogs[,c("latitude", "longitude")]

# A tibble: 2,763 × 2
   latitude longitude
      <dbl>     <dbl>
 1    -42.9      147.
 2    -41.2      146.
 3    -43.0      147.
 4    -41.2      147.
 5    -42.9      148.
 6    -43.1      147.
 7    -42.8      147.
 8    -43.1      148.
 9    -41.6      147.
10    -43.0      147.
# ℹ 2,753 more rows

3.2.3 Separate columns

Sometimes it is useful to split information from one column into several columns. One good example is if genus and species names are contained in one column like scientificName. We can separate these names into two columns using separate() from the tidyr package.

library(tidyr)

frogs_separate <- frogs |>
  separate(scientificName, 
           c("genus", "species"), # new column names
           fill = "right",        # fill missing values in right column
           remove = FALSE         # keep input column
           ) |> 
  select(scientificName, genus, species)

frogs_separate |> rmarkdown::paged_table() # nice format

3.2.4 Join columns

Conversely, we might want to combine information from multiple columns into a single column. We can rejoin the genus and species columns we created in the previous section using unite() from the tidyr package.

frogs_united <- frogs_separate |>
  unite("single_name", 
        genus:species, # select columns to join
        sep = " ",     # separate with a space
        na.rm = TRUE,  # remove NA values
        remove = FALSE # keep input column
        ) |>
  select(genus, species, single_name)

frogs_united |> rmarkdown::paged_table() # nice format

Litoria raniformis close-up
Photo by sonyaf CC-BY-NC 4.0 (Int)

3.3 Summary

In this chapter, we explored different ways to check the class of each column in your table to make sure R is interpreting your data correctly. We also demonstrated how to rename columns for easier handling and how to split or combine columns to access data more conveniently.

In this chapter, we examined various methods to verify the class of each column in your table, ensuring that R correctly interprets your data. We also demonstrated how to rename columns for easier handling and how to split or combine columns for more convenient data access.

In the next chapter, we will learn how to efficiently clean duplicate data. Duplicates can arise from errors in data collection or entry, or from merging data from multiple sources.

To avoid conflicts, R has an internal coercion hierarchy rule to avoid data loss. The rule of thumb is that if a data type can’t exist in a child data type, then the parent data type is used instead. The R coercion hierarchy is:
logical -> integer -> numeric -> complex -> character

You don’t need to memorise this, but it’s worth being aware of this hierarchy, as R might make decisions to prevent a class error and you might not know why! Learn more in this article.↩︎
Queries to the ALA use other coding languages, namely solr and JSON, and column names in these languages are typically in camel case. To maintain consistency with what’s in the ALA and to avoid hidden name cleaning, galah also returns names in camel case.↩︎