UltraGBIF is a high-performance R package designed to compile, validate, and standardize plant occurrence records from the Global Biodiversity Information Facility (GBIF) into analysis-ready datasets. The package functions are categorized into four stages with seven functions (Figure 1).
UltraGBIF achieves outstanding performance through specific technical architectures:
C/C++ Backend Integration: Leverages data.table, stringi, and terra, all implemented in C/C++
Vectorization Over Explicit Loops: Bypasses R’s interpretive overhead by dispatching to pre-compiled routines
SIMD Exploitation: Vectorized operations enable compiler-level SIMD auto-vectorization
Memory-Efficient Design: In-place modification (set(), :=) eliminates intermediate copies
Chunk-Based Parallelization: Vectorized processing within chunks, parallel execution across chunks
Performance: On a standard laptop, UltraGBIF can compile one million occurrence records within 15 minutes.
Prerequisites:
R version: ≥ 4.1.0
Operating system: Windows (10+), macOS (11+), or Linux (Ubuntu 20.04+)
GBIF account: Required for data downloads (free registration at https://www.gbif.org)
System dependencies:
curl (for downloading data)
libssl (for secure connections)
UltraGBIF can be installed using the two commands below:
options(repos = c(getOption("repos"),"https://anonymous.4open.science/r/Repo-902F"))
install.packages("UltraGBIF") Since UltraGBIF already meets the requirements of The Comprehensive R Archive Network (official CRAN), you can quickly install it with a single command in the future:
The following code demonstrates the complete UltraGBIF workflow from data import to dynamic mapping:
download_key <- “0021523-250402121839773”
raw data page: https://doi.org/10.15468/dl.bythb4
# Download Saxifraga occurrence records from GBIF using gbif download key or from gbif web
# data page: https://doi.org/10.15468/dl.bythb4
# Often, downloading files in R is very slow. You can open the data page link
# directly in an external browser and click the download button to download it
# manually. This is usually much faster, especially when downloading larger datasets.
if (!require(rgbif)) install.packages('rgbif'); library(rgbif)
download_key <- "0021523-250402121839773"
fpath <- getwd() # or your own directory, e.g. fpath <- "C:/"
occ_download_get(download_key, path = fpath, overwrite = TRUE)
# Step 1: Import GBIF occurrence records
library(UltraGBIF)
occ_import <- import_records(
GBIF_file = paste0(fpath,"/0021523-250402121839773.zip"),
only_PRESERVED_SPECIMEN = TRUE
)
# Step 2: Standardize taxonomic names using TNRS
taxa_checked <- check_occ_taxon(
occ_import = occ_import,
accuracy = 0.9
)
# Step 3: Standardize collector names
collectors_dictionary <- check_collectors(
occ_import = occ_import,
min_char = 2
)
# Step 4: Generate collection event keys
collection_key <- set_collection_mark(
occ_import = occ_import,
collectors_dictionary = collectors_dictionary
)
# Step 5: Select digital vouchers
voucher <- set_digital_voucher(
occ_import = occ_import,
taxa_checked = taxa_checked,
collection_key = collection_key
)
# Step 6: Refine records (coordinate validation + native status)
refined_records <- refine_records(
voucher = voucher,
threads = 4,
save_path = getwd()
)
# Optional: Visualize results
map_records(refined_records = refined_records, precision = 4, cex = 4)After running the complete workflow, you will obtain:
| Object | Description |
occ_import |
Imported occurrence records with standardized fields |
taxa_checked |
Taxonomically validated names with confidence scores |
collectors_dictionary |
cleaned collector name mappings |
collection_key |
Unique identifiers for each collection event |
voucher |
Filtered digital voucher records |
refined_records |
Final cleaned dataset with validated coordinates and native status |
Download the raw occurrence data via opening GBIF occurrence website
https://www.gbif.org/occurrence/search. Log in GBIF
using your account (if you do not have a GIBF account you can register
following the GBIF instructions). Then come back to filter records
online (you should see a page like Fig2 below). For this examples,
select from the available options
Basis of record = Preserved specimen ( you can also choose
others additionally), Occurrence status = present, and Scientific name =
Saxifraga.
When done, click the DOWNLOAD button and click the DARWIN CORE ARCHIVE button, see Fig1 options marked in red mark. Wait few minutes for GBIF server to process your query and the page will refresh automatically. For Saxifraga used in this tutorial, just open https://doi.org/10.15468/dl.bythb4 and click the DOWNLOAD button as shown in red on Fig3.
Now you get the Darwin Core Archive (DwC-A) formatted data successfully.
Advanced: Using rgbif
This subsection demonstrates how to programmatically download biodiversity occurrence data from the Global Biodiversity Information Facility (GBIF) using the rgbif package in R. The workflow is specifically designed for retrieving Darwin Core Archive (DwC-A) formatted data for the plant genus Saxifraga, with occurrence.txt (interpreted data) and verbatim.txt (raw data) alongside metadata XML files included. You can customize your own search by modifying filter conditions using pred() within the occ_download() function.
Quick Start: Download Using an Existing Download Key
The example data can be retrieved using the download key: 0021523-250402121839773.
if (!require(rgbif)) install.packages('rgbif'); library(rgbif)
if (!require(dplyr)) install.packages('dplyr'); library(dplyr)
# Assign the download key
download_key <- "0021523-250402121839773"
# Check download status first
status <- occ_download_meta(download_key)
print(status$status) # Should show "SUCCEEDED"
# Set download path
fpath <- "C:/Users/YourName/Downloads/"
# Download if status is SUCCEEDED
if(status$status == "SUCCEEDED") {
occ_download_get(download_key, path = fpath, overwrite = TRUE) ## very slow
}Full Workflow: Submit a New Download Request
The following code demonstrates the complete workflow for submitting a custom download request to GBIF.
if (!require(rgbif)) install.packages('rgbif'); library(rgbif);library(dplyr)
## Step 1: Set GBIF account credentials (recommended to write to .Renviron file)
user <- "your_username"
pwd <- "your_password"
email <- "your_email@example.com"
## Step 2: Get the genus key for Saxifraga
gkey <- name_backbone("Saxifraga") # Returns the GBIF taxonomy key
## Step 3: Submit download request (using Saxifraga genus data as an example)
download <- occ_download(
pred("taxonKey", gkey), # Taxon key for Saxifraga
pred("hasCoordinate", TRUE), # Only get records with coordinates
pred_and(pred_gte("year", 1990),
pred_lte("year", 2025)), # Filter data from year 1990 to 2025
user = user,
pwd = pwd,
email = email)
## Step 4: View download ID for subsequent tracking and download
download_key <- download$key
print(download_key)
## Step 5: Wait for the task to complete (periodically checks status)
occ_download_wait(download_key)
## Step 6: Download the ZIP file to local (default saves to current working directory)
occ_download_get(download_key,
path = "your_data_directory/",
overwrite = TRUE) ## very slowAfter downloading the compressed DwC-A formatted data from GBIF, we can import and process the occurrence records using the UltraGBIF package.
library(UltraGBIF);library(data.table)
# Specify the path to your downloaded GBIF file
gbif_occurrence_file <- file.path(path, '0021523-250402121839773.zip')
# Alternatively, you can use an interactive file selection window by gbif_occurrence_file <- file.choose()
# Import example data: only_PRESERVED_SPECIMEN = TRUE keeps
# only physical specimen records (excludes human observations, camera traps, etc.)
occ_import <- import_records(
GBIF_file = gbif_occurrence_file,
only_PRESERVED_SPECIMEN = TRUE
)Common components in occ_import:
occ: raw occurrence records with 55 fields
(Ctrl-prefixed).
occ_gbif_issue: a binary matrix of GBIF issues and
flags indicating common data quality problems, e.g. geospatial issues,
taxonomic issues. See https://data-blog.gbif.org/post/issues-and-flags/ for
details.
summary: the count of occurrence with each specific
issue
runtime: the time consuming of
import_records()
## [1] "occ" "occ_gbif_issue" "summary" "runtime"
#Data samples:
occ_import$occ[,.(Ctrl_gbifID, Ctrl_scientificName, Ctrl_basisOfRecord,
Ctrl_decimalLatitude, Ctrl_decimalLongitude)]%>%head()# View the first 10 rows and first three columns of the occ_gbif_issue
occ_import$occ_gbif_issue[1:5]## Time difference of 3.966486 secs
Summary statistics for raw occurrences:
library(data.table);library(dplyr)
# Initialize a list to store summary information
UltraGBIF_summary <- list()
UltraGBIF_summary$raw_records <- occ_import$occ[, .N]
UltraGBIF_summary$raw_taxa <- occ_import$occ[, uniqueN(Ctrl_scientificName)]
UltraGBIF_summary$families <- occ_import$occ[, uniqueN(Ctrl_family)]
UltraGBIF_summary$dataset <- occ_import$occ[, uniqueN(Ctrl_datasetName)]
UltraGBIF_summary$date_from <- occ_import$occ[, range(Ctrl_year, na.rm = TRUE)][1]
UltraGBIF_summary$date_to <- occ_import$occ[, range(Ctrl_year, na.rm = TRUE)][2]
UltraGBIF_summary%>%as.data.table()Standardizes taxonomic names using the Taxonomic Name Resolution Service (TNRS) (Boyle et al. 2013) against the World Checklist of Vascular Plants (WCVP). The TNRS corrects spelling errors, resolves alternative spellings, and converts outdated synonyms to their current accepted names, enabling consistent taxonomic identification across the dataset.
taxa_checked is a list with three components:
occ_wcvp_check_name: resolved taxonomic name for
each gbif record. Unmached names was NA.
summary: Summary for unique searched statistics of
taxonomic resolution
runtime: the time consuming of
check_occ_taxon()
library(data.table);library(dplyr)
taxa_checked$summary[wcvp_searchedName=="Saxifraga adscendens subsp. discolor",]Simplifies and extracts the primary collector’s name from the
recordedBy field, then builds a standardized collector
table. This module reduces inconsistencies that can fragment single
collection events and improves the accuracy of subsequent duplicate
detection.
library(UltraGBIF)
collectors_dictionary <- check_collectors(
occ_import = occ_import,
min_char = 2
)
# for more details use ?check_collectorscollectors_dictionary is a list with two components:
collectors_dictionary: the primary collector’s name
and the corresponding Standarized name
runtime: the time consuming of
check_occ_taxon()
## Time difference of 10.74665 secs
Generates a standardized composite key for each unique collection event. This module identifies and consolidates duplicates into unique collection events, where a collection event represents a distinct sampling instance (a specific collector at a specific date or with a specific record number).
The collection event key is constructed by concatenating three fields, collector name and collection number generated by check_collectors().
collection_key <- set_collection_mark(
occ_import = occ_import,
collectors_dictionary = collectors_dictionary
)
# for more details use ?set_collection_markcollection_key is a list with four components:
collection_key: cleaned family, collector name and
collection number for each record
full_keys: keys constructed from family, collector
name, and collection number
incomplete_keys: lack one or more of the required
components (family, collector name, or collection number), which may
indicate incomplete metadata in the original records.
runtime: the time consumed by the
set_collection_mark() function.
## Time difference of 0.7315838 secs
Summary statistics for collection_key:
library(data.table)
UltraGBIF_summary$full_keys <- collection_key$full_keys[,.N]
UltraGBIF_summary$incomplete_keys <- collection_key$incomplete_keys[,.N]
UltraGBIF_summary%>%as.matrix()## [,1]
## raw_records 228494
## raw_taxa 1848
## families 1
## dataset 109
## date_from 1550
## date_to 2024
## raw_collectors 37691
## checked_collectors 14113
## full_keys 90186
## incomplete_keys 11736
Resolves duplicate records and selects the master digital voucher to
represent each collection event. For records possessing complete
collection event keys (full keys) from collection_key, duplicates are
identified and grouped; within each group, a master digital voucher is
selected based on a composite quality score. Resolved taxonomic names in
taxa_checked were included.
library(UltraGBIF)
voucher <- set_digital_voucher(
occ_import = occ_import,
taxa_checked = taxa_checked,
collection_key = collection_key
)
gc() # Reports memory usage and triggers a garbage collection
# for more details use ?set_digital_vouchervoucher is a list with three components:
occ_digital_voucher: a table with 80 fields by
adding key issues and flags to raw occurrences.
occ_results: a table only has quality assessment
fields for raw occurrences.
runtime: the time consumed by the
set_collection_mark() function.
Summary statistics for voucher:
## Time difference of 4.499273 secs
For usable vouchers, missing information in key fields is restored by consolidating data from duplicate records belonging to identical collection events. This creates one clean record per unique event. These validated records are automatically subjected to geospatial validation using CoordinateCleaner (Zizka et al. 2019) and terra. Each validated record is matched to its corresponding WGSRPD Level-3 Botanical Country classifications with different status (native, introduced, extinct, location_doubtful), minimizing mis-classification risks in cross-regional analyses. When save_path is specified, native_records and other_records are exported as compressed CSV files (native_refined_records.csv.gz and usable_refined_records.csv.gz). These files can be used for down flow analysis.
library(UltraGBIF)
# Parallel Processing using foreach/doParallel to accelerate processing of large datasets.
refined_records <- refine_records(
voucher = voucher,
threads = 4,
save_path = dirname(gbif_occurrence_file),
tests = c("capitals", "centroids", "equal", "gbif",
"institutions", "outliers", "seas", "zeros"))
# for more details use ?refine_recordsrefined_records is a list with three components:
native_records: occurrences classified as native
according to WCVP
other_records: occurrences classified as as
introduced, extinct, location_doubtful, or unknown
runtime: the time consumed by refine_records()
function.
Summary statistics for refined_records:
## Time difference of 37.24072 secs
The summary list (UltraGBIF_summary) presents key quantitative metrics tracking data reduction and quality filtering across the UltraGBIF processing pipeline. Each statistic captures a distinct stage, from raw GBIF import to the final refined dataset. These statistics collectively illustrate how raw GBIF data are progressively filtered to produce a clean, reproducible, and analysis-ready dataset.
raw_records – Total raw occurrence records
downloaded from GBIF prior to any filtering or processing.
families – Number of raw families of raw
records
dataset – Number of raw datasets of raw
records
date_from – The most distant date of raw
records
date_to – The nearest date of raw records
raw_taxa – Number of unique scientific names present
in the raw data before taxonomic resolution (e.g., synonym
handling).
raw_collectors – Number of unique collector name
strings before standardization (accounts for spelling variants,
initials, etc.).
checked_collectors – Number of unique collector
names after applying standardization rules (e.g., name normalization,
abbreviation expansion).
full_keys – Number of collection event keys
considered complete (i.e., containing all necessary components for
grouping, such as collector, number, and date).
incomplete_keys – Number of collection event keys
missing one or more critical components, making them
non-groupable.
usable_records_before_refine – Number of records
classified as usable after initial voucher selection (e.g., presence of
physical or image voucher).
usable_records_after_refine – Number of records
retained after further refinement steps (e.g., removing duplicates,
suspect georeferences, or invalid dates).
final_taxa – Number of unique accepted taxon names
in the final refined dataset after taxonomic reconciliation and
filtering.
## [,1]
## raw_records 228494
## raw_taxa 1848
## families 1
## dataset 109
## date_from 1550
## date_to 2024
## raw_collectors 37691
## checked_collectors 14113
## full_keys 90186
## incomplete_keys 11736
## usable_records_before_refine 66921
## usable_records_after_refine 59673
## final_taxa 496
Spatial visualization using map_records()
We render verified records onto customizable, dynamic interactive maps using mapview (Appelhans et al. 2023). This module employs geohashTools (Chirico 2023) to perform spatial clustering of geo-coordinates.
After executing map_records with the code in the block below, a browser will open to load the rendered interactive map (it might not automatically pop up in the foreground). In this case, please manually switch to the browser window.
map_records(
refined_records = refined_records, # Output of the UltraGBIF refine_records()
precision = 2, # Rendering precision in 1250 km resolution
cex = 4 # Point size (scaling factor)
)
# for more details use ?map_recordsprecision parameter determines grid resolution
Precision=4: approximately 20 km resolution
Precision=3: approximately 156 km resolution
cex parameter used to adjust symbol size
Dynamic maps support Multiple basemap layers (OpenStreetMap, Esri World Imagery, Stadia Stamen Watercolor), native status information legend. Clickable popups with record metadata.
After [Refine Records], we have got compiled native records from
refined_records, Now we can leverage them for more
downstream applications. letsR(Vilela and Villalobos 2015) is an R package
designed for macroecological analyses based on species geographic
distributions. The package transforms species range maps into
presence-absence matrices (PAMs), where rows represent grid cells and
columns represent species. It provides a comprehensive workflow for
spatial biodiversity analyses, including tools to:
construct PAMs from shapefiles or occurrence points;
integrate environmental variables into species distribution data;
calculate species-level macroecological metrics such as range size, range overlap, and distributional midpoints; (
map species richness across geographic, environmental, and trait spaces;
perform grid-level analyses of biodiversity patterns.
The package implements an S3 class structure
(PresenceAbsence) that facilitates
reproducible spatial analyses and supports integration with IUCN Red
List and BirdLife data. All spatial operations are optimized for
large-scale datasets, making letsR
particularly suitable for comparative biogeography and conservation
prioritization studies.
A Presence-Absence Matrix (PAM) is the fundamental data structure in macroecological research that represents species distributions in a discrete spatial framework.
In essence, a PAM is a two-dimensional matrix where:
Rows represent discrete spatial units (grid cells) covering a geographic domain
Columns represent species
Values are binary:
1 indicates presence,
0 indicates absence
The first two columns typically store the longitude (x) and latitude (y) coordinates of each grid cell’s centroid, allowing the matrix to be spatially referenced.
Why PAMs matter:
Standardization — Converting continuous range maps (shapefiles) or scattered occurrence points into a standardized grid allows meaningful comparisons across species with vastly different distributions
Computational efficiency — Binary matrices enable rapid calculations of biodiversity metrics (richness, turnover, endemism) across thousands of species and millions of cells
Multi-scale integration — PAMs serve as the bridge between species-level data (traits, phylogeny, conservation status) and environmental data (climate, land use, topography)
Reproducibility — Explicitly defined grid systems eliminate ambiguities in spatial resolution, projection, and extent
In letsR’s implementation:
The package wraps PAMs in a
PresenceAbsence S3 object that
bundles:
The presence-absence matrix itself
A raster layer of species richness
Species names and metadata
Grid system parameters (resolution, projection, extent)
This design allows you to apply generic R functions
(plot,
summary,
print) directly to the object while
maintaining spatial integrity throughout analytical workflows.
How can I use native records from UltraGBIF to get the Presence-Absence Matrix? First, prepare input data.
## Warning: package 'letsR' was built under R version 4.5.3
library(data.table)
library(stringi)
library(dplyr)
## prepare
PAM_pre <- refined_records$native_records[,.(UltraGBIF_wcvp_taxon_name,
genus=stri_extract_first_regex(
UltraGBIF_wcvp_taxon_name, "^[^ ]+"),
UltraGBIF_wcvp_family,
UltraGBIF_decimalLongitude,
UltraGBIF_decimalLatitude)]
head(PAM_pre,5)# If you are interested in a specific family or genus, you can manually filter PAM_pre here.
# For example PAM_pre <- PAM_pre[genus=='my_interested_genus',]OK, now construct the Presence-Absence Matrix
library(letsR)
PAM <- lets.presab.points(xy = PAM_pre[,.(UltraGBIF_decimalLongitude,
UltraGBIF_decimalLatitude)]%>%as.matrix(), ## remember as.matrix()
species = PAM_pre$UltraGBIF_wcvp_taxon_name,
resol = 1,## 1 means one degree grids
count = F)## if T show progress bar
## See more details by ?lets.presab.pointsI want to know basic information about my Presence-Absence object:
##
## Class: PresenceAbsence
## _ _
## Number of species: 462
## Number of cells: 3058
## Cells with presence: 3058
## Cells without presence: 0
## Species without presence: 0
## Species with the largest range: Saxifraga cernua
## _ _
## Grid parameters
## Resolution: 1, 1 (x, y)
## Extention: -179.78, 180.22, -54.8333, 83.1667 (xmin, xmax, ymin, ymax)
## Coord. Ref.: +proj=longlat +datum=WGS84 +no_defs
## [1] "Presence_and_Absence_Matrix" "Richness_Raster"
## [3] "Species_name"
## [1] "Micranthes integrifolia" "Micranthes manchuriensis"
## [3] "Micranthes merkii" "Micranthes occidentalis"
## [5] "Micranthes oregana" "Micranthes tenuis"
Which species has the widest distribution range (occupying the most grids)?
library(letsR)
lets.rangesize(x = PAM, units = "cell")%>%
as.data.frame() %>%
arrange(desc(Range_size)) %>%head(5)What are the characteristics of the size of a species’ distribution range?
library(dplyr);library(letsR)
## What is the mean size of species distribution range?
lets.rangesize(x = PAM, units = "cell")%>%mean()%>%ceiling()## [1] 24
## What percentage of species have a distribution range exceeding the mean?
which(lets.rangesize(x = PAM, units = "cell")>
lets.rangesize(x = PAM, units = "cell")%>%
mean()%>%
ceiling())%>%
length()/nrow(lets.rangesize(x = PAM, units = "cell"))## [1] 0.1363636
## plot histogram
lets.rangesize(x = PAM, units = "cell")%>%
log()%>%
hist(main='characteristics of the size of a species distribution range',
xlab='log(counts of grids)')Figure 5 characteristics of the size of a species distribution range
The results show that most species have small distribution ranges, with species exceeding the average range accounting for less than 14%.
It is easy to generate one degree species richness map by letsR:
Figure 6 Saxifraga richness map
I want to view an individual taxon occurrence map:
library(letsR)
selected_taxon <- 'Saxifraga cernua'
plot(PAM,
name=selected_taxon,
main=paste(selected_taxon,'occurrence'))Figure 7 Saxifraga cernua occurrence map
Environmental variables can be integrated into the
presence-absence matrix using the lets.addvar function from
the letsR package.
We added global annual mean temperature data (°C) at 10-arcminute resolution (~340 km² grid cells) from letsR built-in dataset. The function automatically handled spatial alignment by cropping the temperature raster to match the PAM extent, aggregating high-resolution pixels to the PAM grid resolution using mean values, and extracting temperature at each grid cell centroid. The resulting matrix contained species occurrence data alongside mean temperature values for each spatial unit, enabling downstream analyses of species-environment relationships.
library(letsR)
temp <- terra::unwrap(letsR::temp)## wc2.1_10m_bio_1
PAM_env <- lets.addvar(PAM, temp, fun = mean)## add temp var
PAM_env[1:5,c(1,2, ncol(PAM_env))]## view## Longitude(x) Latitude(y) wc2.1_10m_bio_1_mean
## [1,] -76.28 82.6667 -19.95376
## [2,] -74.28 82.6667 -20.00872
## [3,] -68.28 82.6667 -18.36175
## [4,] -67.28 82.6667 -18.96207
## [5,] -66.28 82.6667 -17.52159
Now summarize mean temperature per species:
# Summarize mean temperature per species
temp_mean <- lets.summarizer(PAM_env, ncol(PAM_env))
## Which five species have the highest annual average temperature in their
## distribution range and living environment?
temp_mean%>%arrange(desc(wc2.1_10m_bio_1_mean))%>%head(5)## Which five species have the lowest annual average temperature in their
## distribution range and living environment?
temp_mean%>%arrange(wc2.1_10m_bio_1_mean)%>%head(5)Appelhans, Tim, Florian Detsch, Christoph Reudenbach, and Stefan Woellauer. 2023. “Mapview: Interactive Viewing of Spatial Data in r.” https://CRAN.R-project.org/package=mapview.
Boyle, Brad, Nicole Hopkins, Zhenyuan Lu, Juan Antonio Raygoza Garay, Dmitry Mozzherin, Tony Rees, Naim Matasci, et al. 2013. “The Taxonomic Name Resolution Service: An Online Tool for Automated Standardization of Plant Names.” BMC Bioinformatics 14 (1): 16. https://doi.org/10.1186/1471-2105-14-16.
Chirico, Michael. 2023. “geohashTools: Tools for Working with Geohashes.” https://CRAN.R-project.org/package=geohashTools.
Zizka, Alexander, Daniele Silvestro, Tobias Andermann, Josué Azevedo, Camila Duarte Ritter, Daniel Edler, Harith Farooq, et al. 2019. “CoordinateCleaner : Standardized Cleaning of Occurrence Records from Biological Collection Databases.” Edited by Tiago Quental. Methods in Ecology and Evolution 10 (5): 744–51. https://doi.org/10.1111/2041-210X.13152.