BIEN3.0 Range Methods Description

Cory Merow, Brian Enquist, Brad Boyle, Naia Morueta-Holme, Jens-Christian Svenning

This document provides a brief overview of the methods used to develop range models for the BIEN3.0 database so that users can judge their adequacy for their own applications.

Data used for range modeling - As part of the BIEN workflow, all occurrence records were filtered and standardized with the following protocols:

  1. Name standardization - Taxonomic spelling mistakes and known synonymy issues were resolved using the Taxonomic Name Resolution Service, with Tropicos (www.tropicos.org), The Plant List (www.theplantlist.org) and USDA Plants as the taxonomic authorities (http://plants.usda.gov).
  2. Coordinate accuracy - Range models were built only using geographic coordinates verified as falling within the declared country and state/province (Canada, USA and Brazil only) of observation. We removed records were latitude/longitude was not available or could not be verified.
  3. Cultivated plants - Records that were known or suspected to represent observations of individual cultivated plants were removed.
  4. Non-native species - Records of species not know to be native to the collection locality were removed. The decision to exclude species was based on (a) known introduced status according to country or state checklists, (b) absence from country or state checklists, or © endemism elsewhere. As checklists were not available for many regions of the New World, this evaluation was incomplete.
  5. New World observations - Observations were filtered to include only presences from the New World.

Environmental covariates - Range models were constructed for each species using environmental layers and spatial constraints. These layers were obtained from WorldClim at 5 arc-minute resolution (Hijmans et al. 2005) projected to a 10 km resolution. Predictors included mean annual temperature, mean diurnal temperature range, annual precipitation, precipitation seasonality, precipitation in warmest quarter/ (precipitation in warmest quarter + precipitation in coldest quarter), and five spatial eigenvectors. The spatial eigenvectors corresponded to large scale regional differences and primarily served to limit predictions far from known presence locations in geographic space (Diniz-Filho & Bini, 2005). Only one occurrence record per cell (in cases of multiple records) was used for model building.

Range modeling decision tree - Different range estimation methods were used depending upon the sample size of (unique) presence locations.

  1. A species with a single record was assigned a range that included only the 100km2 cell where it was found.
  2. Ranges for species with 2-3 records were estimated as bounding boxes (area bounded by the minimum and maximum latitude and longitude of all occurrences).
  3. Ranges for species with 4-9 records were estimated as convex hulls (the minimum fitting polygon that could be drawn to encompass all species occurrences).
  4. For species with records greater than 9 observations , we built species distribution models using the Maxent algorithm (Phillips et al. 2006). The maxent model building generally followed the recommendations outlined in Merow et al. (2013) as well as recommendations in Merow et al. (2014) for building relatively less complex models.

Maxent model settings were chosen to balance overfitting (under estimating range sizes) with underfitting (excessively smooth models that over predict range size), generally following recommendations in Merow et al. (2013, 2014). Only linear, quadratic, and product features were used and regularization was set at the default value Maxent’s continuous predictions were converted to binary presence/absence predictions by choosing a threshold based on the 75th percentile of the cumulative output (based on analyses validated with 700 species for which expert maps were available; Morueta-Holme et al., in prep).
Automating range model building - All geographic ranges were run at the Texas Advanced Computing Center (TACC). Approximately 90,000 ranges were run via TACC.

Caveats - Modeling ranges for ~90,000 species is not without potential flaws and some caveats should be recognized. Notably, sample size remains small for the vast majority of specieswith over 50% of the species represented by 5 or fewer occurrences. Consequently, many ranges are estimated using some somewhat coarse methods (i.e. not from species distribution models). In addition, it is impossible to automatically detect all problematic, outlying, or nonnatural occurrence records and those that remain may influence range predictions.

Given our attempts to avoid overfitting, the species distribution models are more likely to underfit spatial distribution patterns and consequently may predict ranges larger than those realized for some species. That is, the models may predict suitable habitat in locations that are inaccessible to the species (but in similar environmental conditions to where they occur) or predict suitable habitat slightly beyond realized range edges due to fitting relatively smoothed response curves. To offset this, cells where presence was predicted by Maxent farther than 1000km from any presence record were removed from the range. The modelling did not account for variation in sampling effort or detection probability.

As with any range map, our predictions represent hypotheses about spatial occurrence patterns. In spite of these caveats, predictions for the vast majority of species are reliable and are well-suited for macroecological analyses.

Updates - Our range modeling efforts are a dynamic enterprise and we are constantly exploring ways to improve predictions, leading to periodic updates in our database. Planned updates include addition of new occurrence data, addition of new information on native versus introduced range, choosing optimal model settings tuned specifically for each species, accounting for sampling variation, and improving occurrence data cleaning methods. We will employ version control to maintain accessibility of all past versions as updates are released.

References

  1. Diniz-Filho, J A F, & Bini, L M. 2005. Modelling geographical patterns in species richness using eigenvector based spatial filters. Global Ecology and Biogeography, 14(2), 177–185.
  2. Hijmans, R. J., S. E. Cameron, J. L. Parra, P. G. Jones, and A. Jarvis. 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25:1965–1978.
  3. Merow, C, Smith, M J, & Silander, J A. 2013. A practical guide to MaxEnt for modeling species’ distributions: what it does, and why inputs and settings matter. Ecography, 36, 1–12.
  4. Merow, C, Smith, M J, Edwards, Jr, T C, Guisan, A, McMahon, S M, Normand, S, Thuiller, W, Wü̈est, R O, Zimmermann, N E, & Elith, J. 2014. What do we gain from simplicity versus complexity in species distribution models? Ecography, 37, 1267–1281.
  5. Phillips, S J, Anderson, R P, & Schapire, R E. 2006. Maximum entropy modeling of species geographic distributions. Ecological Modelling, 190, 231–259.