# packages
library(galah)
library(dplyr)
library(tidyverse)
library(janitor)
library(here)
library(arrow)
galah_config(email = "your-email-here") # ALA-registered email
<- galah_call() |>
tree_kangaroo filter(doi == "https://doi.org /10.26197/ala.ae089212-80fa-48d4-84cf-17ba3c4e8ba4") |>
atlas_occurrences()
<- read_parquet(here("path", "to", "bees.parquet")) bees
6 Strings
Strings are sequences of characters that make up spaces, letters, abbreviations, words or sentences. They can be formatted in many different ways in an individual dataset. For example, these are just some possible ways the scientific name of species could be recorded in a dataset:
"Dendrolagus lumholtzi"
"dendrolagus lumholtzi"
":Dendrolagus_lumholtzi34"
" Dendrolagus lumholtzi "
Terms might be capitalised (or not), have accidental spaces at the beginning or end of a word or sentence, contain typos, or include punctuation; all of these things can impact your ability to consolidate and analyse data accurately.
In this chapter, we focus on general data science techniques to clean strings in a dataset.
6.0.1 Prerequisites
In this chapter, we will use tree kangaroo (Dendrolagus) occurrence data from the ALA and a subset of bee (Apidae) data taken from the Curated Plant and Invertebrate Data for Bushfire Modelling data set, saved in the bees.parquet
file.
Download the bees.parquet
file from the Data in this book chapter.
6.1 Basic string manipulation
The stringr
package provides a number of useful functions for working with strings.
library(stringr)
Trim
Trim whitespace on either side of a string.
str_trim(" Genus specificus ")
[1] "Genus specificus"
Or just one side.
str_trim(" Genus specificus ", side = "left")
[1] "Genus specificus "
Squish
Squish strings into sentence spacing.
str_squish(" Genus specificus ")
[1] "Genus specificus"
Truncate
Truncate a long string to a specified length.
str_trunc("Genus specificus", width = 10, side = "right")
[1] "Genus s..."
Split
Split a string into separate pieces based on a specified character.
str_split("Genus specificus", " ")
[[1]]
[1] "Genus" "specificus"
Concatenate
Concatenate (i.e. join) separate strings into one string separated by a specified character.
str_c("Genus", "specificus", sep = "_")
[1] "Genus_specificus"
6.2 Matching
Matching strings using patterns can be a powerful way to identify or filter records during the data cleaning process.
6.2.1 Detect a pattern
Detect a pattern within a string.
# detect if a pattern exists
str_detect("Genus specificus", "Genus")
[1] TRUE
Use str_detect()
to filter your data.frame
. Here, we filter the species names to only those containing the pattern "lum"
.
# 3 possible names in scientificName column
|> distinct(scientificName) tree_kangaroo
# A tibble: 3 × 1
scientificName
<chr>
1 Dendrolagus lumholtzi
2 Dendrolagus
3 Dendrolagus bennettianus
# detect names matching "lum"
|>
tree_kangaroo filter(str_detect(scientificName, "lum")) |>
select(scientificName)
# A tibble: 813 × 1
scientificName
<chr>
1 Dendrolagus lumholtzi
2 Dendrolagus lumholtzi
3 Dendrolagus lumholtzi
4 Dendrolagus lumholtzi
5 Dendrolagus lumholtzi
6 Dendrolagus lumholtzi
7 Dendrolagus lumholtzi
8 Dendrolagus lumholtzi
9 Dendrolagus lumholtzi
10 Dendrolagus lumholtzi
# ℹ 803 more rows
6.2.2 Remove a pattern
Remove a pattern from a string.
# remove match for Genus (followed by a whitespace)
str_remove("Genus specificus", pattern = "Genus ")
[1] "specificus"
Use str_remove()
to clean or extract names. Here, we remove the genus name from scientificName
and save the result in a new species
column.
|>
tree_kangaroo mutate(
species = ifelse(scientificName != "Dendrolagus",
str_remove(scientificName, "Dendrolagus "),
NA)
|>
) select(scientificName, species)
# A tibble: 1,302 × 2
scientificName species
<chr> <chr>
1 Dendrolagus lumholtzi lumholtzi
2 Dendrolagus lumholtzi lumholtzi
3 Dendrolagus lumholtzi lumholtzi
4 Dendrolagus lumholtzi lumholtzi
5 Dendrolagus <NA>
6 Dendrolagus bennettianus bennettianus
7 Dendrolagus lumholtzi lumholtzi
8 Dendrolagus lumholtzi lumholtzi
9 Dendrolagus <NA>
10 Dendrolagus bennettianus bennettianus
# ℹ 1,292 more rows
6.2.3 Locate a pattern
Locate the position of a pattern within a string. We’ll create an example dataset below.
<- c("Genus",
records "species",
"ZZGenus species",
"Difgenus difspecies")
Find the start and end position of a pattern.
str_locate(records, "Genus")
start end
[1,] 1 5
[2,] NA NA
[3,] 3 7
[4,] NA NA
Find which indices match a pattern. Here, the first and third strings in records
contain the pattern "Genus"
.
str_which(records, "Genus")
[1] 1 3
Add pattern location information to a data.frame
.
|>
tree_kangaroo mutate(
start = str_locate(scientificName, "lum")[, 1],
end = str_locate(scientificName, "lum")[, 2]
|>
) select(scientificName, start, end)
- 1
-
[, 1]
returns column 1 ofstr_locate()
output - 2
-
[, 2]
returns column 2 ofstr_locate()
output
# A tibble: 1,302 × 3
scientificName start end
<chr> <int> <int>
1 Dendrolagus lumholtzi 13 15
2 Dendrolagus lumholtzi 13 15
3 Dendrolagus lumholtzi 13 15
4 Dendrolagus lumholtzi 13 15
5 Dendrolagus NA NA
6 Dendrolagus bennettianus NA NA
7 Dendrolagus lumholtzi 13 15
8 Dendrolagus lumholtzi 13 15
9 Dendrolagus NA NA
10 Dendrolagus bennettianus NA NA
# ℹ 1,292 more rows
6.2.4 Regex matching
The examples above demonstrate the use of basic patterns. But for cases that need more specific or advanced matching, we can use regular expressions (or “regex”). Regex is a powerful tool used to match patterns, replace characters, and extract text from strings. Regex can be complex and unintuitive, but there are websites available, such as Regex 101[^regex-link], that are extremely helpful. ChatGPT is also great for building more complex regex snippets.
[^regex-link: Snippets from this website need additional editing to work correctly in R.]
Here we explore a few basic examples, and keep in mind that these methods can be applied to both column name strings and column values.
The str_view()
function is a useful way to see what a regular expression will return. The results are shown in the console, and elements matched by the regex are surrounded with angle brackets <
>
.
# Match the first word in the string (the genus)
str_view(tree_kangaroo$scientificName, "^[A-Z][a-z]+")
- 1
- This regex reads “Match and omit all letters (capitalised or not) after word one.”
[1] │ <Dendrolagus> lumholtzi
[2] │ <Dendrolagus> lumholtzi
[3] │ <Dendrolagus> lumholtzi
[4] │ <Dendrolagus> lumholtzi
[5] │ <Dendrolagus>
[6] │ <Dendrolagus> bennettianus
[7] │ <Dendrolagus> lumholtzi
[8] │ <Dendrolagus> lumholtzi
[9] │ <Dendrolagus>
[10] │ <Dendrolagus> bennettianus
[11] │ <Dendrolagus> lumholtzi
[12] │ <Dendrolagus>
[13] │ <Dendrolagus>
[14] │ <Dendrolagus> lumholtzi
[15] │ <Dendrolagus> bennettianus
[16] │ <Dendrolagus> lumholtzi
[17] │ <Dendrolagus>
[18] │ <Dendrolagus> lumholtzi
[19] │ <Dendrolagus> lumholtzi
[20] │ <Dendrolagus> lumholtzi
... and 1282 more
# Match only the second word (species name)
str_view(tree_kangaroo$scientificName, "(?<=\\s)[a-z]+")
- 1
- This regex reads “Remove everything until and including the space. Return all uncapitalised letters.”
[1] │ Dendrolagus <lumholtzi>
[2] │ Dendrolagus <lumholtzi>
[3] │ Dendrolagus <lumholtzi>
[4] │ Dendrolagus <lumholtzi>
[6] │ Dendrolagus <bennettianus>
[7] │ Dendrolagus <lumholtzi>
[8] │ Dendrolagus <lumholtzi>
[10] │ Dendrolagus <bennettianus>
[11] │ Dendrolagus <lumholtzi>
[14] │ Dendrolagus <lumholtzi>
[15] │ Dendrolagus <bennettianus>
[16] │ Dendrolagus <lumholtzi>
[18] │ Dendrolagus <lumholtzi>
[19] │ Dendrolagus <lumholtzi>
[20] │ Dendrolagus <lumholtzi>
[21] │ Dendrolagus <lumholtzi>
[24] │ Dendrolagus <lumholtzi>
[25] │ Dendrolagus <bennettianus>
[27] │ Dendrolagus <lumholtzi>
[28] │ Dendrolagus <lumholtzi>
... and 899 more
6.2.5 Replace
Another common way to clean strings is to match and replace specific patterns. Here are several examples using the stringr package and base R.
In stringr, the str_replace()
function can be used to replace the first match of a string. The str_replace_all()
function can be used to replace all matches.
<- c("Genus",
records "species",
"ZZGenus species",
"Difgenus difspecies")
str_replace(records, "[aeiou]", "-") # first match
[1] "G-nus" "sp-cies" "ZZG-nus species"
[4] "D-fgenus difspecies"
str_replace_all(records, "[aeiou]", "-") # all matches
[1] "G-n-s" "sp-c--s" "ZZG-n-s sp-c--s"
[4] "D-fg-n-s d-fsp-c--s"
Replace a matched pattern in a dataframe.
|>
tree_kangaroo mutate(
name_updated = str_replace(
"^[A-Z][a-z]+", "new_name"
scientificName,
)|>
) select(scientificName, name_updated)
- 1
-
This regex reads “Match and omit all letters (capitalised or not) after word one.” We then replace this with “
new_name
”.
# A tibble: 1,302 × 2
scientificName name_updated
<chr> <chr>
1 Dendrolagus lumholtzi new_name lumholtzi
2 Dendrolagus lumholtzi new_name lumholtzi
3 Dendrolagus lumholtzi new_name lumholtzi
4 Dendrolagus lumholtzi new_name lumholtzi
5 Dendrolagus new_name
6 Dendrolagus bennettianus new_name bennettianus
7 Dendrolagus lumholtzi new_name lumholtzi
8 Dendrolagus lumholtzi new_name lumholtzi
9 Dendrolagus new_name
10 Dendrolagus bennettianus new_name bennettianus
# ℹ 1,292 more rows
In base R the gsub()
function can be used for pattern replacement.
<- c("Genus",
records "species",
"ZZGenus species",
"Difgenus difspecies")
gsub("[aeiou]", "-", records) # all matches
[1] "G-n-s" "sp-c--s" "ZZG-n-s sp-c--s"
[4] "D-fg-n-s d-fsp-c--s"
Replace a matched pattern in a dataframe.
$name_updated <- gsub(
tree_kangaroopattern = "Dendrolagus",
replacement = "new_name",
x = tree_kangaroo$scientificName
)
c("scientificName", "name_updated")] tree_kangaroo[,
# A tibble: 1,302 × 2
scientificName name_updated
<chr> <chr>
1 Dendrolagus lumholtzi new_name lumholtzi
2 Dendrolagus lumholtzi new_name lumholtzi
3 Dendrolagus lumholtzi new_name lumholtzi
4 Dendrolagus lumholtzi new_name lumholtzi
5 Dendrolagus new_name
6 Dendrolagus bennettianus new_name bennettianus
7 Dendrolagus lumholtzi new_name lumholtzi
8 Dendrolagus lumholtzi new_name lumholtzi
9 Dendrolagus new_name
10 Dendrolagus bennettianus new_name bennettianus
# ℹ 1,292 more rows
6.3 Capitalisation
Capitalisation (also called case style) can vary between data providers. Each data provider can have their own naming conventions, and even small differences in conventions must be standardised in order to use a dataset. There are some basic functions available to change the case of strings in stringr
:
# example
$scientificName[1] tree_kangaroo
[1] "Dendrolagus lumholtzi"
str_to_lower(tree_kangaroo$scientificName[1])
[1] "dendrolagus lumholtzi"
str_to_upper(tree_kangaroo$scientificName[1])
[1] "DENDROLAGUS LUMHOLTZI"
str_to_title(tree_kangaroo$scientificName[1])
[1] "Dendrolagus Lumholtzi"
str_to_sentence(tree_kangaroo$scientificName[1])
[1] "Dendrolagus lumholtzi"
Normally names of higher taxonomy ranks are capitalised e.g. Myrtaceae, Aves. Capitalisation errors are usually easy to spot when you print the data object. Alternatively, you can use str_subset()
to return capitalisation matches in columns you expect to have capital letters.
For example, in our bees
dataset (downloaded at the start of this chapter) some higher taxonomy columns don’t capitalise names. The code below subsets out unique values for the variable class
that have uppercase letters. Notice that no matches are found.
str_subset(unique(bees$class), "[:upper:]")
character(0)
We can verify that there are no uppercase matches by looking at the unique values containing lowercase letters. This reveals that Insecta is entirely in lowercase.
str_subset(unique(bees$class), "[:lower:]")
[1] "insecta"
We can correct the lowercase formatting as shown below. Remember to verify the correction before overwriting or removing the erroneous column(s).
|>
bees mutate(class_corrected = str_to_sentence(class)) |>
select(starts_with("class"))
# A tibble: 1,139 × 2
class class_corrected
<chr> <chr>
1 insecta Insecta
2 insecta Insecta
3 insecta Insecta
4 insecta Insecta
5 insecta Insecta
6 insecta Insecta
7 insecta Insecta
8 insecta Insecta
9 insecta Insecta
10 insecta Insecta
# ℹ 1,129 more rows
<- bees |>
bees_corrected mutate(class_corrected = str_to_sentence(class)) |>
select(-class) |> # Remove erroneous column
rename(class = class_corrected) # Rename new column to `class`
6.4 Summary
In this chapter, we explored how to identify and clean strings and character pattern data. As you may have noticed, there are many ways in which strings could be formatted, which is why there are so many tools and functions for detecting and modifying them.
In the next chapter, we’ll look at how to clean date and time data.