Preface
People working with ecological or biological data want to know how to clean data effectively, but data cleaning can be a daunting task because of the complexity inherent in these types of data. At the Atlas of Living Australia, questions about data cleaning come up so frequently that it made us wonder…is there a general ecological data cleaning workflow that is considered best-practice?
To find out, we undertook an informal literature review on ecological data cleaning processes using peer-reviewed and grey literature. The key themes we searched for were:
- Data cleaning for species distribution models
- Data cleaning open biodiversity data
- Australian and global naming authorities
- R packages for biodiversity data cleaning
We created a collection of papers that matched these themes, choosing recently published papers that used up-to-date workflows with recent packages and tools in R. We also added several frequently referenced papers for comprehensiveness, and we reviewed data cleaning workflows from our project partners (Marsh et al. 2021; Godfree et al. 2021) to understand their processes, issues, and needs.
From our informal search, we ended up with a list of 18 papers and resources (listed here) with relevant ecological data cleaning workflows. To determine any common, best-practice steps—and common sequences for ordering these steps—we read through their methods sections and collated their data cleaning protocols into a spreadsheet. Our hope was that collating these methods would reveal several clear data cleaning workflows that scientists and researchers use regularly to clean their data.
But, to our surprise, that’s not what we found. Instead, the steps to cleaning ecological data were even messier than we first thought. The diagrams below visualise the findings of our spreadsheet, showing the complexity of processes used to clean data, many of which are iterative.
In the end, our search showed us that there is no single, unified, step-by-step workflow to clean all types of ecological data. Instead, data are cleaned in a huge variety of ways, and the process can look completely different depending on the type of investigation. No wonder people frequently ask how to do it!
This book is our response to questions about how to clean ecological data, and to our discovery that (because workflows vary immensely) there don’t seem to be many resources that consolidate methods to clean ecological data in R. We are by no means experts in cleaning all types of data for all types of ecological analyses. We do, however, work with ecological data on a daily basis and encounter many data cleaning issues in our own work. We aspired for this book to be an up-to-date resource for a diverse range of data cleaning tasks in R. We hope that (at the very least) it is a resource that documents many common ecological data cleaning steps in one place!
Dax Kellie
27 May, 2024