Biodiversity Informatics

Developing tools and methodology for data-intensive biodiversity research
Tomer Gueta

The recent availability of massive volumes of species occurrence data from numerous sources, connected within a single portal, may facilitate answering fundamental ecological questions. Yet, these big biodiversity databases suffer from serious errors and biases, which may invalidate their use in research. Here, I explored three directions to mitigate this problem: First, I developed and evaluated tools to help end-users (ecologists) conduct their own advanced data-cleaning, based on the aim of their specific research. Towards that end, I developed bdclean, a novel R package that facilitates data-cleaning in a user-friendly workflow, specifically designed for inexperienced R users.
Second, I explored the value of case-specific user-level data-cleaning. I develop a new framework to quantify the effect of user-level data-cleaning on data quality using SDMs. My basic assumption here is that the change in SDM performance following data-cleaning reflects the change in data-quality. Data on Australian mammals served to exemplify my approach. I constructed SDMs for various functional groups at six spatial scales. Data-cleaning resulted in significant improvement in gain (SDM performance index) of 5-25% for all functional groups and across all spatial scales.
Third, I proposed and evaluated a novel means of interpreting results, by binding data-cleaning to data analysis. I assume that even the most advanced cleaning procedures are not perfect. I thus use the change in signal between the pre- and post-cleaning phases, in addition to the signal itself, in order to evaluate the research question. I explore this approach using a well-known community-ecology question. The case study concerns the debate over the role of environmental factors in determining species distribution (relative to the roles of stochasticity and dispersal).
I distinguish between three alternative hypotheses (niche, neutral and continuum), using SDM performance as a proxy for the strength of environmental factors over a gradient of species richness. I tested these hypotheses using data downloaded from GBIF. I generated three corresponding datasets using virtual species, in order to validate my predictions and to test various aspects of the analysis. Analyses of the virtual species showed that the niche, continuum, and neutral communities resulted in a clear positive-, negative-, and nonsignificant trends, respectively. Negative correlations between species-richness and the predictive power of environmental factors were more common than positive correlations. The signal was found to be consistent in various thresholds, ensemble techniques, and spatial grids, and was supported by the virtual species results. Comparing the results before– and after data-cleaning, there was a consistent trend, in which the signal became stronger and clearer after data-cleaning. The results, therefore, provide strong support to the continuum hypothesis.

My research reveals the merit of incorporating data-cleaning as part of the data analysis when working with biodiversity big-data for answering macro-ecological questions; and builds tools towards the best practice of user-level data-cleaning. The tools and methodology which I developed throughout this research can improve our ability to answer ecological questions, specifically in empirical analysis that build upon data available from large biodiversity databases.