The Neotoma Paleoecology Database empowers users to generate new
paleobiological insights through its curation of thousands of paleo
datasets and associated metadata. The extent of the metadata that
Neotoma curates may not be obvious to its users, because it is not all
exposed in the Neotoma Explorer app, or even in the functions provided
by the neotoma2
R package. If you are unfamiliar with a
paleorecord, then it is especially important for you to grapple with its
metadata to ensure that you understand the data you are working
with.
This tutorial will demonstrate how to explore more of this extensive metadata, with a focus on site and core location. By the end of this tutorial, you should be able to
(This tutorial will rely on a mix of neotoma2
R package
calls and Neotoma API calls. For a more comprehensive introduction to
Neotoma API calls, please see this page.)
Before we do any data wrangling, let’s define metadata. Metadata are data about data; they contextualize data, making it meaningful and usable. Among other things, they can concern provenance (who collected the data? From where?), method (how did they collect the data?), and policy and ethics (did the data collection process require a review process? How are the data licensed to be used?). It is important to notice that the distinction between data and metadata can be a little slippery - it depends on the question you’re asking. If you’re doing a single-site analysis of pollen proportions over time, then the pollen proportions are your data, and the site coordinates are metadata. But if you’re looking across multiple sites for any pollen surface samples with a Cyperaceae proportion greater than 10%, then your site coordinates are now themselves the data.
Neotoma has an extensive data model, with lots of opportunities for finding metadata in various tables. However, the extent of metadata for any particular dataset is uneven; it depends on the level of detail supplied when the data were originally uploaded to Neotoma. In what follows, we’ve selected sites for which the metadata are relatively complete. But if you were to reproduce the following workflow on a new dataset, you might run into problems if, say, the collection unit GPS coordinates are not provided in Neotoma.
First, let’s load up some packages we’ll need. The following code
installs the package management library pacman
if you don’t
have it installed already, and uses that library to install the other
packages we’ll need.
Now that we have our packages loaded, we can grab some sites that we
care about. This tutorial assumes you already know how to get data
through the neotoma2
R package. If you need a refresher,
please see this tutorial.
We’ll look for sites from Lake Tanganyika. We grab the sites, and
then we use the neotoma2::plotLeaflet()
function to see
where those sites are.
We found four sites, and they seem to be located along the middle of
the lake. But we need to be careful here. The plotLeaflet()
function will plot all sites in Neotoma as points, even if the
coordinates of the site geometries are really stored in Neotoma as
polygons. In the Neotoma data model, sites refer to the general
localities from which paleodata are extracted, but not necessarily the
precise locations at which cores are taken. In the case of a small lake,
this might not make a difference to your analysis. But for a big lake
like Lake Tanganyika, the difference between the centroid of a polygon
and the actual core locations may be meaningfully different.
Fortunately, Neotoma often stores core location information in a separate object, the collection unit. Sites are the places where your data collection took place generally, and collection units are the particular materials you’re analyzing from a site - collection units could be cores, excavations, middens, sections, specimens, etc. More on the distinction between sites and collection units here.
Let’s check whether the Lake Tanganyika site objects are really points, or if they are in fact polygons.
The plotLeaflet()
function is a great tool for
understanding general site location, but it can sometimes mislead us
into thinking that a site is described with more precision in Neotoma
than it actually is. This isn’t always a problem - sometimes all we need
is general site location. But if we’re trying to be careful about site
location - if, for instance, we’re doing an analysis that is sensitive
to small-scale spatial variation - then we’re going to want to grab more
authoritative Neotoma metadata about site location. We can do this
through one of Neotoma’s
API calls.
In particular, we’re going to use an API that requires a Neotoma
siteid, or comma-separated string of siteids, as an input. The output of
this API will be site and dataset metadata. First we’ll format the
siteids from the Tanganyika sites we found earlier as a comma-separated
string, and then we’ll paste them into the particular API call we’d like
to make. We send a GET()
request and unpack the response
with the content()
function, focusing just on what’s under
the data header. And we can look at the beginning of our output
(which we call siteMetadata).
siteidString = paste0(as.data.frame(tanganyikaSites)$siteid,collapse=",")
apiCall = paste0('https://api.neotomadb.org/v2.0/data/sites/',siteidString,'/datasets')
response = GET(apiCall)
siteMetadata = content(response)$data
siteMetadata[[1]]$site[1:5]
## $siteid
## [1] 27348
##
## $sitename
## [1] "Lake Tanganyika"
##
## $sitedescription
## [1] "Open water offshore of coast, Lake Tanganyika is permanently anoxic below 150 m depth and is considered an oligotrophic system."
##
## $sitenotes
## NULL
##
## $geography
## [1] "{\"type\":\"Polygon\",\"crs\":{\"type\":\"name\",\"properties\":{\"name\":\"EPSG:4326\"}},\"coordinates\":[[[29.12625,-8.80904],[29.12625,-3.37081],[31.12576,-3.37081],[31.12576,-8.80904],[29.12625,-8.80904]]]}"
siteMetadata is a list of … well … site metadata. Even just looking
at the first few entries of this list, we can see that at least one site
that plotLeaflet()
rendered as a point is in fact a
polygon.
But the list structure is hard to read. Let’s reformat this list as a datatable, and for now, let’s only keep site metadata, not dataset metadata. Because we’re only keeping site metadata, some of our entries will be duplicated - we can keep just the distinct entries.
In the following code chunk, we first do a nested loop through all the datasets in all the sites objects listed in siteMetadata, just to count how many distinct datasets there are. We use that value to set the number of rows in a matrix object, which we use to store the site/dataset metadata when we loop through the same nested list again. Lastly, we turn that matrix object into a data.frame and name the columns appropriately, and keep just the distinct entries in our new data frame.
idx = 0
for (i in seq(length(siteMetadata))) {
for (j in seq(length(siteMetadata[[i]]$site$datasets))) {
idx = idx + 1
}
}
siteMetadata_mat = matrix(nrow=idx,ncol=10)
idx2 = 0
for (i in seq(length(siteMetadata))) {
for (j in seq(length(siteMetadata[[i]]$site$datasets))) {
idx2 = idx2 + 1
for (k in seq(10)) {
if (!is.null(siteMetadata[[i]]$site[[k]])) {
siteMetadata_mat[[idx2, k]] = siteMetadata[[i]]$site[[k]]
}
}
}
}
siteMetadata_df = as.data.frame(siteMetadata_mat)
names(siteMetadata_df) = c("siteid","sitename","sitedescription","sitenotes",
"geography","altitude","collectionunitid",
"collectionunit","handle","unittype")
siteMetadata_df = distinct(siteMetadata_df)
#just first ten cols
datatable(siteMetadata_df[1:10],rownames=FALSE)
You’ll notice that even after removing duplicate entries, we have eleven rows, but only four sites. This is because the metadata under the site header also includes collectionunit information - more on the collecitonunits soon - and each of these sites is associated with between 1 and 4 collection units.
Before talking about the collectionunits in more detail, let’s map
the new site location metadata, to check whether in fact they are points
or polygons. In order to do this, we’ll use the
geojson_sf()
function to convert geography column of our
siteMetadata_df dataframe into an sf (simple features) object, and
rebind that sf object to the rest of the dataframe. Then we’ll split
apart those sites that have point geometry (pointSites) from those which
have polygon geometry (polySites), keeping only those rows with distinct
siteids, and plot both objects in a leaflet plot.
siteGeo_sf = geojson_sf(siteMetadata_df$geography)
siteMetadata_sf = cbind(siteGeo_sf,siteMetadata_df)
pointSites = siteMetadata_sf[st_geometry_type(siteMetadata_sf) == "POINT",] %>%
distinct(siteid, .keep_all = TRUE)
polySites = siteMetadata_sf[st_geometry_type(siteMetadata_sf) == "POLYGON",] %>%
distinct(siteid, .keep_all = TRUE)
leaflet() %>%
addTiles() %>%
addPolygons(data = polySites,
color = "red",
weight = 2,
fillColor = "orange",
fillOpacity = 0.15,
popup = ~sitename) %>%
addCircleMarkers(data = pointSites,
radius = 5,
color = "blue",
fillColor = "blue",
fillOpacity = 0.7,
stroke = FALSE,
popup = ~sitename)
Aha: three of our four sites are in fact polygons, not points ! The site object in Neotoma may only bound the locality in which an investigation has occurred in the most general terms.
So we’ve learned that the site doesn’t give us precise information about the location of a pollen record. Where, then, do we find the information about core location?? It turns out we need to look to the collectionunits table, accessible through an API.
The collectionunits table is the place to go to learn about core-specific metadata, including precise core location. If a site is often a lake or peat bog, the collection unit is usually a single core from the site. In order to access the collectionunits associated with our Tanganyika sites, we first need to download the entire collectionunits table. Then we’ll filter for the collection units we care about.
Below, we grab the collectionunits table as a list, reformat it into a dataframe, and then an sf object, and plot it on top of our map. Because we’re not grabbing metadata nested in collectionunits, we don’t need to do the loop twice as we did above to find the number of rows we need for our matrix object. Instead, we know that the number of rows will just be equal to the number of entries in the collunits list - that is, the length of collunits.
text="collectionunits"
collunits = content(GET(paste0('https://api.neotomadb.org/v2.0/data/dbtables/',
text,
'?count=false&limit=99999&offset=0')))$data
collunit_mat = matrix(nrow=length(collunits),ncol=20)
for (i in seq(length(collunits))) {
for (j in seq(20)) {
if (!is.null(collunits[[i]][[j]])) {
collunit_mat[[i,j]] = collunits[[i]][[j]]
}
}
}
collunit_df = collunit_mat %>% as.data.frame()
names(collunit_df) = c("collectionunitid","handle","siteid","colltypeid",
"depenvtid","collunitname","colldate","colldevice",
"gpslatitude","gpslongitude","gpsaltitude","gpserror",
"waterdepth","substrateid","slopeaspect","slopeangle",
"location","notes","recdaterecreated","recdatemodified")
filtered_colls = collunit_df %>%
dplyr::filter(collectionunitid %in% siteMetadata_df$collectionunitid)
filteredColl_sf = collunit_df %>%
dplyr::filter(collectionunitid %in% siteMetadata_df$collectionunitid) %>%
st_as_sf(coords=c("gpslongitude","gpslatitude"), crs="WGS84")
#just first ten cols
datatable(st_drop_geometry(filteredColl_sf)[c(1:3,5:10,15)],rownames=FALSE)
pal <- colorFactor(palette = "Set1", domain = siteMetadata_df$siteid)
polySites$arranger = cbind( 1/as.numeric(st_area(polySites)))
polySites = polySites %>% arrange(arranger)
leaflet() %>%
addTiles() %>%
addPolygons(data = polySites,
color = "black",
weight = 1,
fillColor = ~pal(siteid),
fillOpacity = 0.55,
popup = ~sitename) %>%
addCircleMarkers(data = pointSites,
radius = 8,
color = "black",
fillColor = ~pal(siteid),
fillOpacity = 0.7,
stroke = TRUE,
weight = 1,
popup = ~sitename) %>%
addCircleMarkers(data = filteredColl_sf,
radius = 3,
color = "white",
fillColor = ~pal(siteid),
fillOpacity = 1,
stroke = TRUE,
weight = 1,
popup = ~collunitname)
In the above plot, sites and associated cores are filled with the same color. Sites have black outlines, and cores have white outlines, and if you click on a site or core its name should pop up. And what do you know ! With the exception of site “Lake Tanganyika [SD cores]” (our only point site), there are mismatches between general site location and core location.
This raises the question, at what spatial resolution, and for what sorts of analyses, do these mismatches start to matter? See this demonstration for one answer.
We were able to find important location information by downloading the entire collectionunits table, and we could do that because we knew about the existence of this table. But Neotoma has a sophisticated data model - there are tables with relevant information whose existence we’re not going to know about.
Luckily, Neotoma’s data model is open, and available for exploration at Open Neotoma. We can use this resource to explore further metadata. If for instance, we wanted to see what other information is connected to our collection units, we can navigate to the page in Open Neotoma on the collection units tables. This page provides the names and definitions of the fields in the collection units table, and if you scroll to the bottom of the page, you can also see what other tables the collection units table is connected to through primary/foreign key pairs.
If we wanted to know more about the substrate for our collection units, we could download the rocktypes table that we see referenced on this page (using the same download-table API call from earlier) and filter the table for the relevant substrate ids from our collection units.
text2="rocktypes"
rocktypes = content(GET(paste0('https://api.neotomadb.org/v2.0/data/dbtables/',
text2,
'?count=false&limit=99999&offset=0')))$data
rock_mat = matrix(nrow = length(rocktypes),ncol=6)
for (i in seq(length(rocktypes))) {
for (j in seq(6)) {
if (!is.null(rocktypes[[i]][[j]])) {
rock_mat[i,j] = rocktypes[[i]][[j]]
}
}
}
rock_df = as.data.frame(rock_mat)
names(rock_df) = c("rocktypeid","rocktype","higherrocktypeid","description",
"recdatecreated","recdatemodified")
filtered_colls = filtered_colls %>% left_join(rock_df,by=join_by("substrateid" == "rocktypeid"))
datatable(filtered_colls[c(1,3,6,14,15,16,21,22,23)],rownames=FALSE)
If we do that, we find the substrate is either clay or mud, silt, or unrecorded. I guess that’s not too surprising for sediment cores !
Over the course of this tutorial, we discovered that Neotoma sites from Lake Tanganyika are described mostly with polygons, not points, and we learned where to find more precise information about core location for those sites. Lastly, we briefly explored a resource for discovering even more Neotoma metadata.
We hope that this dip into Neotoma metadata motivates you to dive further into the Neotoma data model - it’s a rich resource ! If you would like to provide feedback about this tutorial, please complete this form.