Integrating data-cleaning with data analysis to enhance usability of big biodiversity data
The recent availability of species occurrence data from numerous sources, standardized and connected within a single portal, has the potential to answer fundamental ecological questions. These aggregated big biodiversity databases are prone to numerous data errors and biases. The data-user is responsible for identifying these errors and assessing if the data are suitable for a given purpose. We propose to solve this problem by binding data cleaning to data analysis. The change in signal between the pre- and post-cleaning phases will be used to evaluate the research question. Only if the answer to the research question becomes consistently stronger and clearer following the data cleaning, then it is a robust answer.
I am exploring this approach using a well-known community-ecology question. The case study concerns the debate over the role of environmental factors in determining species distribution (relative to the roles of stochasticity and dispersal). One hypothesis, supported by ecological theory, suggests that in species-poor communities, environmental factors have a strong impact on species distribution. This role is expected to decline with increased richness, where drift and biological interactions are expected to play an increasingly stronger role. In contrast, a strict interpretation of the niche theory prescribes the opposite trend, i.e., positive relations between the relative role of environmental factors and species richness, since niches become presumably narrower when more species compete within a given environmental space. I am using more than one million records of Australian mammals to construct species distribution models for regions with different species richness values. The explanatory power of a distribution model, which indicates the effect of environmental factors, will be evaluated against the region-specific species richness for the respective guild. The trend in the results before- and after cleaning will serve as a strong indication in favor of one hypothesis or another, even in the presence of major data error and bias.
Adopting a more comprehensive approach for incorporating data cleaning as part of data analysis, will not only improve the quality of biodiversity data, but will impose a more appropriate usage of such data. This can greatly serve the scientific community and consequently our ability to address more accurately urgent conservation issues.