6 Strings

Strings are sequences of characters that make up spaces, letters, abbreviations, words or sentences. They can be formatted in many different ways in an individual dataset. For example, these are just some possible ways the scientific name of species could be recorded in a dataset:

"Dendrolagus lumholtzi"
"dendrolagus lumholtzi"
":Dendrolagus_lumholtzi34"
" Dendrolagus lumholtzi "

Terms might be capitalised (or not), have accidental spaces at the beginning or end of a word or sentence, contain typos, or include punctuation; all of these things can impact your ability to consolidate and analyse data accurately.

In this chapter, we focus on general data science techniques to clean strings in a dataset.

6.0.1 Prerequisites

In this chapter, we will use tree kangaroo (Dendrolagus) occurrence data from the ALA and a subset of bee (Apidae) data taken from the Curated Plant and Invertebrate Data for Bushfire Modelling data set, saved in the bees.parquet file.

Download the bees.parquet file from the Data in this book chapter.

# packages
library(galah)
library(dplyr)
library(tidyverse)
library(janitor)
library(here)
library(arrow)
galah_config(email = "your-email-here") # ALA-registered email

tree_kangaroo <- galah_call() |>
  filter(doi == "https://doi.org /10.26197/ala.ae089212-80fa-48d4-84cf-17ba3c4e8ba4") |>
  atlas_occurrences()

bees <- read_parquet(here("path", "to", "bees.parquet"))

Dendrolagus lumholtzi sitting on a branch.
Photo by matthewkwan CC-BY

Braunsapis species hovering in front of a ghost gum tree flower.
Photo by Zig Madycki CC-BY-NC-ND 4.0 (Int)

Original download queries

Note: You don’t need to run this code block to read this chapter. It can, however, be useful to see the original download query. This code will download the latest data from the ALA, which you are welcome to use instead, though the data might not exactly reproduce results in this chapter.

library(galah)
galah_config(email = "your-email-here")

tree_kangaroo <- galah_call() |>
  galah_identify("Dendrolagus") |>
  atlas_occurrences()

1: We created a custom DOI for our download by using atlas_occurrences(mint_doi = TRUE).

6.1 Basic string manipulation

The stringr package provides a number of useful functions for working with strings.

library(stringr)

Trim

Trim whitespace on either side of a string.

str_trim("  Genus specificus  ")

[1] "Genus specificus"

Or just one side.

str_trim("  Genus specificus  ", side = "left")

[1] "Genus specificus  "

Squish

Squish strings into sentence spacing.

str_squish("  Genus   specificus  ")

[1] "Genus specificus"

Truncate

Truncate a long string to a specified length.

str_trunc("Genus specificus", width = 10, side = "right")

[1] "Genus s..."

Split

Split a string into separate pieces based on a specified character.

str_split("Genus specificus", " ")

[[1]]
[1] "Genus"      "specificus"

Concatenate

Concatenate (i.e. join) separate strings into one string separated by a specified character.

str_c("Genus", "specificus", sep = "_")

[1] "Genus_specificus"

6.2 Matching

Matching strings using patterns can be a powerful way to identify or filter records during the data cleaning process.

6.2.1 Detect a pattern

Detect a pattern within a string.

# detect if a pattern exists
str_detect("Genus specificus", "Genus")

[1] TRUE

Use str_detect() to filter your data.frame. Here, we filter the species names to only those containing the pattern "lum".

# 3 possible names in scientificName column
tree_kangaroo |> distinct(scientificName)

# A tibble: 3 × 1
  scientificName          
  <chr>                   
1 Dendrolagus lumholtzi   
2 Dendrolagus             
3 Dendrolagus bennettianus

# detect names matching "lum"
tree_kangaroo |>
  filter(str_detect(scientificName, "lum")) |>
  select(scientificName)

# A tibble: 813 × 1
   scientificName       
   <chr>                
 1 Dendrolagus lumholtzi
 2 Dendrolagus lumholtzi
 3 Dendrolagus lumholtzi
 4 Dendrolagus lumholtzi
 5 Dendrolagus lumholtzi
 6 Dendrolagus lumholtzi
 7 Dendrolagus lumholtzi
 8 Dendrolagus lumholtzi
 9 Dendrolagus lumholtzi
10 Dendrolagus lumholtzi
# ℹ 803 more rows

6.2.2 Remove a pattern

Remove a pattern from a string.

# remove match for Genus (followed by a whitespace)
str_remove("Genus specificus", pattern = "Genus ")

[1] "specificus"

Use str_remove() to clean or extract names. Here, we remove the genus name from scientificName and save the result in a new species column.

tree_kangaroo |>
  mutate(
    species = ifelse(scientificName != "Dendrolagus",
                     str_remove(scientificName, "Dendrolagus "),
                     NA)
  ) |>
  select(scientificName, species)

# A tibble: 1,302 × 2
   scientificName           species     
   <chr>                    <chr>       
 1 Dendrolagus lumholtzi    lumholtzi   
 2 Dendrolagus lumholtzi    lumholtzi   
 3 Dendrolagus lumholtzi    lumholtzi   
 4 Dendrolagus lumholtzi    lumholtzi   
 5 Dendrolagus              <NA>        
 6 Dendrolagus bennettianus bennettianus
 7 Dendrolagus lumholtzi    lumholtzi   
 8 Dendrolagus lumholtzi    lumholtzi   
 9 Dendrolagus              <NA>        
10 Dendrolagus bennettianus bennettianus
# ℹ 1,292 more rows

6.2.3 Locate a pattern

Locate the position of a pattern within a string. We’ll create an example dataset below.

records <- c("Genus", 
             "species", 
             "ZZGenus species", 
             "Difgenus difspecies")

Find the start and end position of a pattern.

str_locate(records, "Genus")

     start end
[1,]     1   5
[2,]    NA  NA
[3,]     3   7
[4,]    NA  NA

Find which indices match a pattern. Here, the first and third strings in records contain the pattern "Genus".

str_which(records, "Genus")

[1] 1 3

Add pattern location information to a data.frame.

tree_kangaroo |>
  mutate(
    start = str_locate(scientificName, "lum")[, 1],
    end = str_locate(scientificName, "lum")[, 2]
  ) |>
  select(scientificName, start, end)

1: [, 1] returns column 1 of str_locate() output
2: [, 2] returns column 2 of str_locate() output

# A tibble: 1,302 × 3
   scientificName           start   end
   <chr>                    <int> <int>
 1 Dendrolagus lumholtzi       13    15
 2 Dendrolagus lumholtzi       13    15
 3 Dendrolagus lumholtzi       13    15
 4 Dendrolagus lumholtzi       13    15
 5 Dendrolagus                 NA    NA
 6 Dendrolagus bennettianus    NA    NA
 7 Dendrolagus lumholtzi       13    15
 8 Dendrolagus lumholtzi       13    15
 9 Dendrolagus                 NA    NA
10 Dendrolagus bennettianus    NA    NA
# ℹ 1,292 more rows

Dendrolagus bennettianus grasping a tree branch.
Photo by David White CC-BY

6.2.4 Regex matching

The examples above demonstrate the use of basic patterns. But for cases that need more specific or advanced matching, we can use regular expressions (or “regex”). Regex is a powerful tool used to match patterns, replace characters, and extract text from strings. Regex can be complex and unintuitive, but there are websites available, such as Regex 101[^regex-link], that are extremely helpful. ChatGPT is also great for building more complex regex snippets.

[^regex-link: Snippets from this website need additional editing to work correctly in R.]

Here we explore a few basic examples, and keep in mind that these methods can be applied to both column name strings and column values.

The str_view() function is a useful way to see what a regular expression will return. The results are shown in the console, and elements matched by the regex are surrounded with angle brackets < >.

# Match the first word in the string (the genus)
str_view(tree_kangaroo$scientificName, "^[A-Z][a-z]+")

1: This regex reads “Match and omit all letters (capitalised or not) after word one.”

 [1] │ <Dendrolagus> lumholtzi
 [2] │ <Dendrolagus> lumholtzi
 [3] │ <Dendrolagus> lumholtzi
 [4] │ <Dendrolagus> lumholtzi
 [5] │ <Dendrolagus>
 [6] │ <Dendrolagus> bennettianus
 [7] │ <Dendrolagus> lumholtzi
 [8] │ <Dendrolagus> lumholtzi
 [9] │ <Dendrolagus>
[10] │ <Dendrolagus> bennettianus
[11] │ <Dendrolagus> lumholtzi
[12] │ <Dendrolagus>
[13] │ <Dendrolagus>
[14] │ <Dendrolagus> lumholtzi
[15] │ <Dendrolagus> bennettianus
[16] │ <Dendrolagus> lumholtzi
[17] │ <Dendrolagus>
[18] │ <Dendrolagus> lumholtzi
[19] │ <Dendrolagus> lumholtzi
[20] │ <Dendrolagus> lumholtzi
... and 1282 more

# Match only the second word (species name)
str_view(tree_kangaroo$scientificName, "(?<=\\s)[a-z]+")

1: This regex reads “Remove everything until and including the space. Return all uncapitalised letters.”

 [1] │ Dendrolagus <lumholtzi>
 [2] │ Dendrolagus <lumholtzi>
 [3] │ Dendrolagus <lumholtzi>
 [4] │ Dendrolagus <lumholtzi>
 [6] │ Dendrolagus <bennettianus>
 [7] │ Dendrolagus <lumholtzi>
 [8] │ Dendrolagus <lumholtzi>
[10] │ Dendrolagus <bennettianus>
[11] │ Dendrolagus <lumholtzi>
[14] │ Dendrolagus <lumholtzi>
[15] │ Dendrolagus <bennettianus>
[16] │ Dendrolagus <lumholtzi>
[18] │ Dendrolagus <lumholtzi>
[19] │ Dendrolagus <lumholtzi>
[20] │ Dendrolagus <lumholtzi>
[21] │ Dendrolagus <lumholtzi>
[24] │ Dendrolagus <lumholtzi>
[25] │ Dendrolagus <bennettianus>
[27] │ Dendrolagus <lumholtzi>
[28] │ Dendrolagus <lumholtzi>
... and 899 more

6.2.5 Replace

Another common way to clean strings is to match and replace specific patterns. Here are several examples using the stringr package and base R.

str_replace()
gsub()

In stringr, the str_replace() function can be used to replace the first match of a string. The str_replace_all() function can be used to replace all matches.

records <- c("Genus", 
             "species", 
             "ZZGenus species", 
             "Difgenus difspecies")

str_replace(records, "[aeiou]", "-")     # first match

[1] "G-nus"               "sp-cies"             "ZZG-nus species"    
[4] "D-fgenus difspecies"

str_replace_all(records, "[aeiou]", "-") # all matches

[1] "G-n-s"               "sp-c--s"             "ZZG-n-s sp-c--s"    
[4] "D-fg-n-s d-fsp-c--s"

Replace a matched pattern in a dataframe.

tree_kangaroo |>
  mutate(
    name_updated = str_replace(
      scientificName, "^[A-Z][a-z]+", "new_name"
      )
  ) |>
  select(scientificName, name_updated)

1: This regex reads “Match and omit all letters (capitalised or not) after word one.” We then replace this with “new_name”.

# A tibble: 1,302 × 2
   scientificName           name_updated         
   <chr>                    <chr>                
 1 Dendrolagus lumholtzi    new_name lumholtzi   
 2 Dendrolagus lumholtzi    new_name lumholtzi   
 3 Dendrolagus lumholtzi    new_name lumholtzi   
 4 Dendrolagus lumholtzi    new_name lumholtzi   
 5 Dendrolagus              new_name             
 6 Dendrolagus bennettianus new_name bennettianus
 7 Dendrolagus lumholtzi    new_name lumholtzi   
 8 Dendrolagus lumholtzi    new_name lumholtzi   
 9 Dendrolagus              new_name             
10 Dendrolagus bennettianus new_name bennettianus
# ℹ 1,292 more rows

In base R the gsub() function can be used for pattern replacement.

records <- c("Genus", 
             "species", 
             "ZZGenus species", 
             "Difgenus difspecies")

gsub("[aeiou]", "-", records) # all matches

[1] "G-n-s"               "sp-c--s"             "ZZG-n-s sp-c--s"    
[4] "D-fg-n-s d-fsp-c--s"

Replace a matched pattern in a dataframe.

tree_kangaroo$name_updated <- gsub(
  pattern = "Dendrolagus",
  replacement = "new_name",
  x = tree_kangaroo$scientificName
)

tree_kangaroo[,c("scientificName", "name_updated")]

# A tibble: 1,302 × 2
   scientificName           name_updated         
   <chr>                    <chr>                
 1 Dendrolagus lumholtzi    new_name lumholtzi   
 2 Dendrolagus lumholtzi    new_name lumholtzi   
 3 Dendrolagus lumholtzi    new_name lumholtzi   
 4 Dendrolagus lumholtzi    new_name lumholtzi   
 5 Dendrolagus              new_name             
 6 Dendrolagus bennettianus new_name bennettianus
 7 Dendrolagus lumholtzi    new_name lumholtzi   
 8 Dendrolagus lumholtzi    new_name lumholtzi   
 9 Dendrolagus              new_name             
10 Dendrolagus bennettianus new_name bennettianus
# ℹ 1,292 more rows

6.3 Capitalisation

Capitalisation (also called case style) can vary between data providers. Each data provider can have their own naming conventions, and even small differences in conventions must be standardised in order to use a dataset. There are some basic functions available to change the case of strings in stringr:

# example
tree_kangaroo$scientificName[1]

[1] "Dendrolagus lumholtzi"

str_to_lower(tree_kangaroo$scientificName[1])

[1] "dendrolagus lumholtzi"

str_to_upper(tree_kangaroo$scientificName[1])

[1] "DENDROLAGUS LUMHOLTZI"

str_to_title(tree_kangaroo$scientificName[1])

[1] "Dendrolagus Lumholtzi"

str_to_sentence(tree_kangaroo$scientificName[1])

[1] "Dendrolagus lumholtzi"

Normally names of higher taxonomy ranks are capitalised e.g. Myrtaceae, Aves. Capitalisation errors are usually easy to spot when you print the data object. Alternatively, you can use str_subset() to return capitalisation matches in columns you expect to have capital letters.

For example, in our bees dataset (downloaded at the start of this chapter) some higher taxonomy columns don’t capitalise names. The code below subsets out unique values for the variable class that have uppercase letters. Notice that no matches are found.

str_subset(unique(bees$class), "[:upper:]")

character(0)

Apis (Apis) mellifera looking for some pollen.
Photo by Reiner Richter CC-BY

We can verify that there are no uppercase matches by looking at the unique values containing lowercase letters. This reveals that Insecta is entirely in lowercase.

str_subset(unique(bees$class), "[:lower:]")

[1] "insecta"

We can correct the lowercase formatting as shown below. Remember to verify the correction before overwriting or removing the erroneous column(s).

bees |>
  mutate(class_corrected = str_to_sentence(class)) |>
  select(starts_with("class"))

# A tibble: 1,139 × 2
   class   class_corrected
   <chr>   <chr>          
 1 insecta Insecta        
 2 insecta Insecta        
 3 insecta Insecta        
 4 insecta Insecta        
 5 insecta Insecta        
 6 insecta Insecta        
 7 insecta Insecta        
 8 insecta Insecta        
 9 insecta Insecta        
10 insecta Insecta        
# ℹ 1,129 more rows

bees_corrected <- bees |>
  mutate(class_corrected = str_to_sentence(class)) |>
  select(-class) |>               # Remove erroneous column
  rename(class = class_corrected) # Rename new column to `class`

6.4 Summary

In this chapter, we explored how to identify and clean strings and character pattern data. As you may have noticed, there are many ways in which strings could be formatted, which is why there are so many tools and functions for detecting and modifying them.

In the next chapter, we’ll look at how to clean date and time data.