This page will introduce the use of Neotoma APIs and describe some
situations when they might be preferable to the use of the
neotoma2
R package.
An API (application programming interface) is a set of rules that allows different computers or computer components to communicate. One set of APIs enable a user’s computer programs to access resources managed by the same computer’s operating system. For example, a program might request memory through an API known as a system call.
The APIs we’re concerned with here are Web APIs. That is, they’re APIs that use Web protocols like HTTP to enable communication between computers through the internet. Web APIs are fundamental to web-based information sharing.
There are a range of Neotoma APIs that can be accessed through this
page. When you call a
Web API, it follows the HTTP protocol, which means you issue a request
like GET, POST, or PUT, etc., and you receive a response. Let’s see how
we can make these API calls in R. We’ll need to download the library
httr
, which has those HTTP calls like GET()
as
well as a content()
function that helps decode the response
we receive.
We’ll make the following API call: https://api.neotomadb.org/v2.0/data/occurrences?taxonid=35619&limit=10. This call will return the first 10 occurrences within the Neotoma database of samples with taxon id 35619, which corresponds to Canis spp. (How do I know this taxon id is the right one for Canis? See below!)
We want to wrap our call in the GET()
function, and then
decode the contents.
firstAPI = GET("https://api.neotomadb.org/v2.0/data/occurrences?taxonid=35619&limit=10")
print(firstAPI)
## Response [https://api.neotomadb.org/v2.0/data/occurrences?taxonid=35619&limit=10]
## Date: 2025-03-20 12:23
## Status: 200
## Content-Type: application/json; charset=utf-8
## Size: 3.99 kB
Status: 200 is a good sign; that means the call issued successfully.
Now we use content()
to get the data. The output of
content()
, called insides, has three components: status,
data, and message. Most of the time, status will be “success”, and
message will be “retrieved all tables,” so we mostly care about the
data. But in case you’re ever running into issues and need to debug, it
can be helpful to consider what the status and message are.
## [[1]]
## [[1]]$occid
## [1] 5314651
##
## [[1]]$sample
## [[1]]$sample$taxonid
## [1] 35619
##
## [[1]]$sample$taxonname
## [1] "Canis spp."
##
## [[1]]$sample$value
## [1] 67
##
## [[1]]$sample$sampleunits
## [1] "NISP"
##
##
## [[1]]$age
## [[1]]$age$age
## NULL
##
## [[1]]$age$ageolder
## [1] 1180
##
## [[1]]$age$ageyounger
## [1] 560
##
##
## [[1]]$site
## [[1]]$site$datasetid
## [1] 37057
##
## [[1]]$site$siteid
## [1] 21575
##
## [[1]]$site$sitename
## [1] "Clachan [NaPi-2]"
##
## [[1]]$site$altitude
## [1] 3
##
## [[1]]$site$location
## [1] "{\"type\":\"Point\",\"crs\":{\"type\":\"name\",\"properties\":{\"name\":\"EPSG:4326\"}},\"coordinates\":[-114.83333,68.16667]}"
##
## [[1]]$site$datasettype
## [1] "vertebrate fauna"
##
## [[1]]$site$database
## [1] "FAUNMAP"
We just successfully ran our first API! But the format we received is a little hard to work with…
Web APIs return their responses in JSON (JavaScript Object Notation) format. JSON represents data as arrays of objects in which keys that define a property are assigned values. The value might be a number or string, or it could itself be an object or array of objects. Here’s a snippet of what JSON format looks like:
[{“occid”:[5314651],“sample”:{“taxonid”:[35619],“taxonname”:[“Canis spp.”],“value”:[67],“sampleunits”:[“NISP”]},“age”:{“age”:{},“ageolder”:[1180],“ageyounger”:[560]},“site”:{“datasetid”:[37057],“siteid”:[21575],“sitename”:[“Clachan [NaPi-2]”],“altitude”:[3],“location”:[“{"type":"Point","crs":{"type":"name","properties":{"name":"EPSG:4326"}},"coordinates":[-114.83333,68.16667]}”],“datasettype”:[“vertebrate fauna”],“database”:[“FAUNMAP”]}}
In R, it is natural to represent these JSON arrays as nested lists:
list(occid = 5314651, sample = list(taxonid = 35619, taxonname = “Canis spp.”, value = 67, sampleunits = “NISP”), age = list(age = NULL, ageolder = 1180, ageyounger = 560), site = list(datasetid = 37057, siteid = 21575, sitename = “Clachan [NaPi-2]”, altitude = 3, location = “{"type":"Point","crs":{"type":"name","properties":{"name":"EPSG:4326"}},"coordinates":[-114.83333,68.16667]}”, datasettype = “vertebrate fauna”, database = “FAUNMAP”))
However, it is often easier to visualize an API response as a table rather than a list, which requires some looping. First, we make a matrix with appropriate dimensions. Every row will represent a single observation, and every column will represent a single variable. Then we loop through our nested list and assign each value in the list to an appropriate place in the matrix. Finally, we convert the matrix to a dataframe and name the columns appropriately.
canis_mat = matrix(nrow=length(insides),ncol=12)
for (i in seq(length(insides))) {
if(!is.null(insides[[i]]$occid)) {
canis_mat[[i,1]] = insides[[i]]$occid
}
for (j in seq(4)) {
if(!is.null(insides[[i]]$sample[[j]])) {
canis_mat[[i,(1+j)]] = insides[[i]]$sample[[j]]
}
}
for (k in seq(7)) {
if(!is.null(insides[[i]]$site[[k]])) {
canis_mat[[i,(5+k)]] = insides[[i]]$site[[k]]
}
}
}
canis_df = as.data.frame(canis_mat)
names(canis_df) = c("occid","taxonid","taxonname","value","units","datasetid","siteid","sitename","altitude","location","datasettype","database")
datatable(canis_df, rownames=FALSE)
You may not have used an API explicitly before, but if you’ve used
the neotoma2
R package, then you’ve already used it
implicitly. All neotoma2
functions at some point require
use of the helper function neotoma2::parseURL()
. This
function is somewhat long, but we can use grep()
to search
through it for any mention of an API. When we do we get the following
result:
baseurl <- switch(use, dev = "https://api-dev.neotomadb.org/v2.0/", , neotoma = "https://api.neotomadb.org/v2.0/", local = "http://localhost:3005/v2.0/",
parseURL()
calls the Neotoma API for the
neotoma2
R package. In other words, any
neotoma2
function is ultimately using the Neotoma API to
communicate with the database. So why would we ever want to use the APIs
directly, rather than mediate through the neotoma2
R
package? There are at least two reasons:
Let’s examine each of these reasons in order.
In order to compare how the call times of the neotoma2
functions and the Neotoma API scale with the size of the data being
handled, we’ll download increasingly large number of sites using the
neotoma2
function get_sites()
as well as the
API call https://api.neotomadb.org/v2.0/data/sites?.
get_sites()
and the API both take as one of their inputs
the parameter loc, where we can supply a spatial extent in geoJSON
format. We’ll define increasingly large spatial extents as sf objects,
and use the R function system.time()
to measure how long
they take. (system.time()
has three outputs: user, system,
and elapsed. We mostly care about elapsed, which is the actual wall time
taken for the function to complete. User and system have to do with how
the computer operating system allocates resources to run the
function.)
lats = c(43, 50, 50, 43)
lons= c(-65, -65, -60, -60)
coordinates = data.frame(lat = lats, lon = lons)
coordinates_sf = coordinates %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
summarise(geometry = st_combine(geometry)) %>%
st_cast("POLYGON")
bbox_geojson = sf_geojson(coordinates_sf)
R_getsites_time = system.time(neotoma2::get_sites(loc = bbox_geojson, all_data = TRUE))
api_sites = content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",
bbox_geojson,
"&limit=9999&offset=0")))$data
api_getsites_time = system.time(content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",
bbox_geojson,
"&limit=99999&offset=0")))$data)
print(length(api_sites))
## [1] 153
The length of the api_sites object printed above is the number of sites we downloaded: around 150.
Let’s print the length of time required to fetch those 150 sites through R:
## user system elapsed
## 1.88 0.07 7.30
Only 5 or so seconds. With the API:
## user system elapsed
## 0.03 0.00 0.70
Faster than 5 seconds! Let’s repeat this a few times, varying the number of sites downloaded.
lats1 = c(43, 50, 50, 43)
lons1= c(-70, -70, -60, -60)
coordinates1 = data.frame(lat = lats1, lon = lons1)
coordinates1_sf = coordinates1 %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
summarise(geometry = st_combine(geometry)) %>%
st_cast("POLYGON")
bbox_geojson1 = sf_geojson(coordinates1_sf)
R_getsites_time1 = system.time(neotoma2::get_sites(loc = bbox_geojson1, all_data = TRUE))
api_sites1 = content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",
bbox_geojson1,
"&limit=9999&offset=0")))$data
api_getsites_time1 = system.time(content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",
bbox_geojson1,
"&limit=99999&offset=0")))$data)
print(length(api_sites1))
## [1] 486
For almost 500 sites, it takes about 15 seconds for the R package, and still just about 1 second for the API:
## user system elapsed
## 6.46 0.15 16.70
## user system elapsed
## 0.10 0.00 0.97
lats2 = c(33, 50, 50, 33)
lons2 = c(-75, -75, -60, -60)
coordinates2 = data.frame(lat = lats2, lon = lons2)
coordinates2_sf = coordinates2 %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
summarise(geometry = st_combine(geometry)) %>%
st_cast("POLYGON")
bbox_geojson2 = sf_geojson(coordinates2_sf)
R_getsites_time2 = system.time(neotoma2::get_sites(loc = bbox_geojson2, all_data = TRUE))
api_sites2 = content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",
bbox_geojson2,
"&limit=9999&offset=0")))$data
api_getsites_time2 = system.time(content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",
bbox_geojson2,
"&limit=99999&offset=0")))$data)
print(length(api_sites2))
## [1] 1676
For over 1600 sites, it takes almost 80 seconds with the R package, and just 2 seconds withe API:
## user system elapsed
## 24.15 0.33 64.22
## user system elapsed
## 0.32 0.00 1.64
The pattern continues below.
lats3 = c(23, 50, 50, 23)
lons3 = c(-80, -80, -60, -60)
coordinates3 = data.frame(lat = lats3, lon = lons3)
coordinates3_sf = coordinates3 %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
summarise(geometry = st_combine(geometry)) %>%
st_cast("POLYGON")
bbox_geojson3 = sf_geojson(coordinates3_sf)
R_getsites_time3 = system.time(neotoma2::get_sites(loc = bbox_geojson3, all_data = TRUE))
api_sites3 = content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",
bbox_geojson3,
"&limit=9999&offset=0")))$data
api_getsites_time3 = system.time(content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",
bbox_geojson3,
"&limit=99999&offset=0")))$data)
print(R_getsites_time3)
## user system elapsed
## 42.83 0.67 126.11
## user system elapsed
## 0.66 0.02 2.01
## [1] 3236
lats4 = c(23, 50, 50, 23)
lons4 = c(-90, -90, -60, -60) # Reordered for a rectangle
coordinates4 = data.frame(lat = lats4, lon = lons4)
coordinates4_sf = coordinates4 %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
summarise(geometry = st_combine(geometry)) %>%
st_cast("POLYGON")
bbox_geojson4 = sf_geojson(coordinates4_sf)
R_getsites_time4 = system.time(neotoma2::get_sites(loc = bbox_geojson4, all_data = TRUE))
api_sites4 = content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",
bbox_geojson4,
"&limit=9999&offset=0")))$data
api_getsites_time4 = system.time(content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",
bbox_geojson4,
"&limit=99999&offset=0")))$data)
print(R_getsites_time4)
## user system elapsed
## 86.95 1.35 318.62
## user system elapsed
## 1.26 0.05 4.15
## [1] 6283
Below you can see how the R package gets slower and slower the
greater the number of sites you’re trying to grab, while the API call is
always quick. (We added a few points that aren’t displayed above just to
fill out the curve without adding too much clutter to the page.) There’s
a clear lesson here: the more data you’re dealing with, the bigger the
payoff associated with using the APIs directly rather than mediating
through the neotoma2
package.
Rtimes = c(R_getsites_time[[3]],R_getsites_time1[[3]],R_getsites_time2[[3]],R_getsites_time3[[3]],
R_getsites_time4[[3]],R_getsites_timea[[3]],R_getsites_timeb[[3]],
R_getsites_timec[[3]],R_getsites_timed[[3]],R_getsites_timee[[3]],
R_getsites_timef[[3]],R_getsites_timeg[[3]])
apitimes = c(api_getsites_time[[3]],api_getsites_time1[[3]],api_getsites_time2[[3]],api_getsites_time3[[3]],
api_getsites_time4[[3]],api_getsites_timea[[3]],api_getsites_timeb[[3]],api_getsites_timec[[3]],
api_getsites_timed[[3]],api_getsites_timee[[3]],
api_getsites_timef[[3]],api_getsites_timeg[[3]])
site_num = c(length(api_sites),length(api_sites1),length(api_sites2),length(api_sites3),length(api_sites4),length(api_sitesa),length(api_sitesb),length(api_sitesc),length(api_sitesd),length(api_sitese),length(api_sitesf),length(api_sitesg))
time_df = data.frame(Rt = Rtimes,api_t = apitimes, sites = site_num)
ggplot(time_df) +
geom_point(mapping=aes(x=sites,y=Rt),color="red",alpha=0.7) +
geom_point(mapping=aes(x=sites,y=api_t),color="blue",alpha=0.7) +
theme_bw() +
scale_y_continuous(name="time (seconds)") +
scale_x_continuous(name = "number of sites")
The Neotoma database contains extensive metadata that are not all exposed through the R package. (See this tutorial for even more information about Neotoma metadata.) The API is useful for gathering these metadata. For example, you can see in the image below (from the Neotoma open schema), some of the metadata tables linked to samples:
Earlier in this tutorial, I claimed without any evidence that the taxonid which corresponds to Canis spp. is 35619. The way I figured that out was through the Neotoma metadata available by API call. I grabbed the taxa table using a particularly versatile API call, “/v2.0/data/dbtables/”, that downloads whichever table name you supply. I turned the returned list into a dataframe for easier analysis, and I grepped through the entire taxa table for any taxa name that included “Canis.”
taxa = content(GET("https://api.neotomadb.org/v2.0/data/dbtables/taxa?count=false&limit=99999&offset=0"))$data
taxa_mat = matrix(nrow=length(taxa),ncol=14)
for (i in seq(length(taxa))) {
for (j in seq(14)) {
if(!is.null(taxa[[i]][[j]])) {
taxa_mat[[i,j]] = taxa[[i]][[j]]
}
}
}
taxa_df = as.data.frame(taxa_mat)
names(taxa_df) = c("taxonid","taxoncode","taxonname","author",
"valid","highertaxonid","extinct","taxagroupid",
"publicationid","validatorid","validatedate",
"notes","recdatecreated","recdatemodified")
datatable(taxa_df[grep("Canis",taxa_df$taxonname),],rownames=FALSE)
In this tutorial, we briefly introduced the concept of an API: what
they are, how their outputs are formatted, and how the Neotoma APIs
relate to the Neotoma2 R
package. We used a few APIs, and
along the way we introduced two compelling reasons to use them: 1)
they’re much faster for large data downloads than R, and 2) they provide
access to more metadata than the Neotoma2
R package.
If you would like to provide feedback about this tutorial, please complete this form or reach out to Nick Hoffman at nicholashoffman@ucmerced.edu.