1 Introduction

When you are using and comparing pollen data from multiple sites, it is important to make sure pollen taxa are named consistently across sites. This process is called taxon harmonization. Throughout this tutorial we will be using Olea, the olive, as an example.

There are a few different reasons why taxon harmonization is important.

  1. Taxon names change over time. For example, the African olive once was thought to be its own species, Olea africana. However, with new botanical and molecular information, we now know that it is actually a subspecies, rather than its own species. So the accepted name has now been updated to Olea europaea subsp. cuspidata. Older datasets in Neotoma may refer to the same taxon by its older name.
  2. A particular taxon can be identified to varying resolution. For instance, Olea capensis pollen can be identified to a species-level morphotype, but sometimes analysts will only identify it to the genus level. Depending on the nature of your analysis, you may therefore want to aggregate all the Olea capensis pollen in your data to a broader Olea category.
  3. Many plant taxa are cosmopolitan. Across a regional or continental synthesis, it isn’t obvious how you should deal with the question of splitting and lumping. In the case of Olea, depending on your question, you may want to treat the population from southern Africa as distinct from the east African population - or you may not. This is something you should be intentional about.

There are multiple good ways to harmonize. It all depends on what best suits the analysis you intend.

Luckily for us, The African Pollen Database curates a valuable table for assisting in taxa harmonization across African pollen. This guide will walk you through use of the APD taxa harmonization table with a simple example.

1.1 Packages and Data

We’ll first load up some packages we’re going to be using, and then grab some pollen data to play with from Neotoma.

if (length(grep("pacman",as.data.frame(installed.packages())$Package)) ==0) {
  install.packages("pacman") 
  library(pacman)
  p_load(neotoma2,tidyverse,DT,geojsonsf,sf,leaflet,httr,stringr, ggplot2, tmap, rosm, osmdata,plotly)
} else {
  library(pacman)
  p_load(neotoma2,tidyverse,DT,geojsonsf,sf,leaflet,httr,stringr, ggplot2, tmap, rosm, osmdata,plotly)}

We make a bounding box that encompasses all of Africa. Then we grab all Neotoma sites from that box, and filter for just those datasetids that concern pollen. Then we use the Neotoma2 package to download all of those pollen data.

lats = c(38, 38, -36, -36)
lons = c(-18, 52, 52, -18) # Reordered for a rectangle

# Create a data frame with coordinates
coordinates = data.frame(lat = lats, lon = lons)

# Convert to sf object and create a polygon
coordinates_sf = coordinates %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
  summarise(geometry = st_combine(geometry)) %>%
  st_cast("POLYGON")

# Plot to check
tm_shape(osm.raster(coordinates_sf)) +
  tm_rgb() +
  tm_shape(coordinates_sf) +
  tm_polygons(alpha = 0.5)

coord_json = sf_geojson(coordinates_sf)

sites = content(GET(paste0("https://api.neotomadb.org/v2.0/data/sites?loc=",
                           coord_json,
                           "&limit=9999&offset=0")))$data


idx = 0
for (i in seq(length(sites))) {
  for (j in seq(length(sites[[i]]$collectionunits))) {
    for (k in seq(length(sites[[i]]$collectionunits[[j]]$datasets))) {
    idx = idx + 1
    }
    }
}


sites_mat = matrix(nrow=idx,ncol=11)

idx2 = 0
for (i in seq(length(sites))) {
  for (j in seq(length(sites[[i]]$collectionunits))) {
    for (k in seq(length(sites[[i]]$collectionunits[[j]]$datasets))) {
    idx2 = idx2 + 1
    for (m in seq(5)) {
      if (!is.null(sites[[i]][[m]])) {
        sites_mat[[idx2, m]] = sites[[i]][[m]]
      }
    }
    
     if (!is.null(sites[[i]]$collectionunits[[j]]$handle)) {
        sites_mat[[idx2,6]] = sites[[i]]$collectionunits[[j]]$handle
     }
       if (!is.null(sites[[i]]$collectionunits[[j]]$collectionunit)) {
        sites_mat[[idx2,7]] = sites[[i]]$collectionunits[[j]]$collectionunit
       }
       if (!is.null(sites[[i]]$collectionunits[[j]]$collectionunitid)) {
        sites_mat[[idx2,8]] = sites[[i]]$collectionunits[[j]]$collectionunitid
       }
       if (!is.null(sites[[i]]$collectionunits[[j]]$collectionunittype)) {
        sites_mat[[idx2,9]] = sites[[i]]$collectionunits[[j]]$collectionunittype
       }
       if (!is.null(sites[[i]]$collectionunits[[j]]$dataset[[k]]$datasetid)) {
        sites_mat[[idx2,10]] = sites[[i]]$collectionunits[[j]]$dataset[[k]]$datasetid
       }
       if (!is.null(sites[[i]]$collectionunits[[j]]$dataset[[k]]$datasettype)) {
        sites_mat[[idx2,11]] = sites[[i]]$collectionunits[[j]]$dataset[[k]]$datasettype
       }
    }
  }
}

sites_df = as.data.frame(sites_mat)

names(sites_df) = c("siteid","sitename","sitedescription","geography",
                    "altitude","handle","collectionunit","collectionunitid",
                    "collectionunittype","datasetid","datasettype")


datasetids = sites_df %>% dplyr::filter(datasettype == "pollen") %>% dplyr::distinct(datasetid)

datasets_neo = get_datasets(as.numeric(datasetids$datasetid),all_data=TRUE)

data = samples(get_downloads(datasets_neo,all_data=TRUE))
## ..........................................................................................................................................................................................................................................................................................................................................................................................

Now that we have our data, let’s grab our harmonization table. That comes from the African Pollen Database’s website, and you can download it directly through the “download table [.csv]” button on this page.

We’ll manipulate the data little bit to make sums of pollen counts by site and age. Then we’ll divide every pollen count by the appropriate sum in order to get proportion data.

apd_harmTable = read.csv("APD_dictionnary_export.csv",row.names=NULL,sep=";")


poltots = data %>% left_join(apd_harmTable, by=join_by(variablename == Taxon..original.name.)) %>%
  group_by(siteid,age) %>% summarize(poltot = sum(value))

pollendata = data  %>% left_join(poltots) %>% group_by(siteid,age,variablename) %>%
  summarize(prop = value/poltot, lat=lat, elev = elev) %>% 
  left_join(apd_harmTable, by=join_by(variablename == Taxon..original.name.))

2 Olea capensis elevation shifts

Let’s say we’re interested in elevation shifts in Olea capensis over time. We can filter our data for just those instances where Olea capensis is greater than 5% of the pollen assemblage with the code below. We filter for when the variablename is “Olea capensis” and when its proportion is greater than 0.05. When I do that in November 2024, I get 127 instances of greater than 5% Olea capensis.

When we plot the elevation of those occurrences over time, we get a plot that shows no clear pattern of elevation change of Olea capensis throughout Africa over the last 20,000 years, since the last ice age.

oleacap_noHarm = pollendata %>%
  dplyr::filter(variablename == "Olea capensis") %>%
  dplyr::filter(!is.na(prop)) %>%
  
  dplyr::filter(prop > 0.05)


ggplot() +
  geom_point(mapping=aes(x=age,y=elev),alpha=0.8,color='red',data=oleacap_noHarm) +
  scale_x_reverse(limits=c(20000,0)) +
  scale_y_continuous(name = "elevation (meters)") +
  theme_bw()

The story we got from those Olea capensis data seems reliable. But we might be excluding some occurrences of Olea capensis that may have been named slightly differently by different analysts.

oleatable = pollendata %>%
  dplyr::filter(str_detect(variablename,"Olea")) %>%
  
  dplyr::filter(str_detect(variablename,"capensis")) %>%
  group_by(variablename) %>% count() %>%
  left_join(apd_harmTable, by=join_by(variablename == Taxon..original.name.)) %>%
  arrange(desc(n))
 
datatable(oleatable[c(1,3,2)],rownames=FALSE)

Consider the table above that shows all number of occurrences for all the different taxa with the words “Olea” and “capensis” somewhere in them from our dataset, alongside the recommended nomenclature for that taxon from the APD taxon harmonization table. “Olea capensis” is the most commonly used taxon - but there are other categories which we may want include in our analysis. In particular, the taxon is often referred to as “Olea capensis-type” to communicate that the identification is of a pollen morphology that is associated with Olea capensis but may not uniquely identify Olea capensis. The APD considers Olea capensis and Olea capensis-type to be equivalent taxa.

Let us therefore include this second taxon, Olea capensis-type, in our analysis and see if it makes a difference. We probably also want to include the other three categories here - all variations on the hochstetteri subspecies of Olea capensis. Because Olea capensis pollen cannot be reliably identified to the subspecies resolution, any identification as hochstetteri in pollen likely derives from local knowledge that given the age and location of the Olea capensis pollen, it must be hochstetteri. Since we’re doing a broad regional analysis of Olea capensis over time, that fine distinction doesn’t matter to us.

We include these new taxa below by filtering for any of the variablenames in our table where their proportion is greater than 0.05. As of November 2024, this search returns 336 occurrences - almost 200 more than our original search !

When we plot these extended data over our old data, we see a clear increase in elevation at the start of the Holocene.

oleacap_Harm = pollendata %>%
  dplyr::filter(variablename %in% oleatable$variablename) %>%
  dplyr::filter(!is.na(prop)) %>%
  
  dplyr::filter(prop > 0.05)




ggplot() +
    geom_point(mapping=aes(x=age,y=elev),alpha=0.8,color='blue',data=oleacap_Harm) +
    geom_point(mapping=aes(x=age,y=elev),alpha=0.8,color='red',data=oleacap_noHarm) +
  scale_x_reverse(limits=c(20000,0)) +
  theme_bw()
## Warning: Removed 122 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 10 rows containing missing values or values outside the scale range
## (`geom_point()`).

3 Pollen Analyst Comparison

Part of the reason there is so much variation in taxon naming is because it is difficult to be completely consistent from person to person and lab to lab. This is another key reason why harmonization matters In this section, we’ll compare the taxonomic systems of pollen analysts working near each other through a principal components analysis (PCA).

First we’ll grab the entire sampleanalysts table from Neotoma and filter for those pollen analysts who counted pollen from Africa. Then we’ll join our table of contacts to our data by siteid.

text="sampleanalysts"
analysts = content(GET(paste0('https://api.neotomadb.org/v2.0/data/dbtables/',
                              text,
                              '?count=false&limit=999999&offset=0')))$data

analyst_mat = matrix(nrow = length(analysts),ncol=6)
for (i in seq(length(analysts))) {
  for (j in seq(6)) {
    if (!is.null(analysts[[i]][[j]])) {
      analyst_mat[i,j] = analysts[[i]][[j]]
    }
  }
}

analyst_df = as.data.frame(analyst_mat)

names(analyst_df) = c("analystid","sampleid","contactid",
                      "analystorder","recdatecreated","recdatemodified")


distinct_samples = data %>% distinct(sampleid)

filtered_an = analyst_df %>% dplyr::filter(sampleid %in% distinct_samples$sampleid)
merger = filtered_an %>% left_join(data) %>% group_by(siteid) %>% summarize(siteid=siteid, lat=lat,long=long,contactid=contactid) %>% distinct() %>% st_as_sf(coords=c("long","lat"), crs="WGS84") %>% ungroup()

We’ll filter our new table for just those sites in eastern Africa, so that all the sites we look at will have relatively similar ecology, and any differences we among analysts will be more likely to derive from different ways of naming.

lats = c(-16, -16, 16, 16)
lons = c(26, 40, 40, 26) # Reordered for a rectangle

coordinates = data.frame(lat = lats, lon = lons)

coordinates_sf = coordinates %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
  summarise(geometry = st_combine(geometry)) %>%
  st_cast("POLYGON")


merger_east = st_filter(merger,coordinates_sf)

tm_shape(osm.raster(coordinates_sf)) +
  tm_rgb() +
  tm_shape(coordinates_sf) +
  tm_borders(alpha = 0.5) +
  tm_shape(merger_east) +
  tm_dots(col="contactid",size=0.5)
## Zoom: 6
## stars object downsampled to 816 by 1224 cells. See tm_shape manual (argument raster.downsample)

We pivot our data table wider, making a table where each row corresponds to a particular analyst, and each column to a taxon name. We give a value of 1 to taxa names used by an analyst, and 0 to ones not used by an analyst. Then we use prcomp() to make a PCA. First we do this using the unharmonized names supplied by the original pollen analysts, and then we repeat the operation, this time using the recommended nomenclature from the APD.

In order to visualize all of these points in the same space of all the possible names an analyst might have applied or been recommended to apply by the APD, we need to combine our two tables (wide_data and wide_data_harm), which requires a bit of data wrangling. We have to add columns to each table of all the non-overlapping names from the other table, and fill those new columns with a value of 0, before we combine our two tables into wide_data_all. Finally, we can perform a pca on the resulting table without the contactids (justvals_all) using prcomp(). In our resulting table, we see positive evidence that harmonization leads to tighter grouping of the kinds of names people use to describe their pollen. We can also check out which taxa names have the most effect on causing a person’s vocabulary to be placed where it is along PCs 1 and 2.

wide_data = data %>%
  dplyr::filter(siteid %in% merger_east$siteid) %>%
  left_join(merger) %>%  left_join(apd_harmTable, by=join_by(variablename == Taxon..original.name.)) %>%
  dplyr::filter(!is.na(Taxon..revised.nomenclature.)) %>% 
  group_by(contactid,variablename) %>% count() %>%
  pivot_wider(id_cols = contactid, names_from = variablename, values_from = n, values_fill = 0) 
## Joining with `by = join_by(siteid)`
## Warning in left_join(., merger): Detected an unexpected many-to-many relationship between `x` and `y`.
## i Row 19791 of `x` matches multiple rows in `y`.
## i Row 49 of `y` matches multiple rows in `x`.
## i If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
wide_data[,-1][wide_data[,-1] !=0 ] = 1


wide_data_harm = data %>%
  dplyr::filter(siteid %in% merger_east$siteid) %>%
  left_join(merger) %>%
  left_join(apd_harmTable, by=join_by(variablename == Taxon..original.name.)) %>%
  dplyr::filter(!is.na(Taxon..revised.nomenclature.)) %>% group_by(contactid,Taxon..revised.nomenclature.) %>%
  count() %>% pivot_wider(id_cols = contactid, names_from = Taxon..revised.nomenclature., values_from = n, values_fill = 0) 
## Joining with `by = join_by(siteid)`
## Warning in left_join(., merger): Detected an unexpected many-to-many relationship between `x` and `y`.
## i Row 19791 of `x` matches multiple rows in `y`.
## i Row 49 of `y` matches multiple rows in `x`.
## i If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
wide_data_harm[,-1][wide_data_harm[,-1] !=0 ] = 1



extranames_harm = as.data.frame(names(wide_data))
names(extranames_harm) = c("extraharm")
extranames_harm = extranames_harm %>% dplyr::filter(!extraharm %in% names(wide_data_harm))

added_part_harm = as.data.frame(matrix(nrow=30,ncol=length(extranames_harm[[1]]),data=0))
names(added_part_harm) = extranames_harm[[1]]


extranames_noharm = as.data.frame(names(wide_data_harm))
names(extranames_noharm) = c("extraharm")
extranames_noharm = extranames_noharm %>%
  dplyr::filter(!extraharm %in% names(wide_data)) %>%
  dplyr::filter(!extraharm %in% c("pc1","pc2"))

added_part_noharm = as.data.frame(matrix(nrow=30,ncol=length(extranames_noharm[[1]]),data=0))
names(added_part_noharm) = extranames_noharm[[1]]

wide_data_harm = wide_data_harm %>% cbind(added_part_harm)
wide_data = wide_data %>% cbind(added_part_noharm)


wide_data = wide_data %>% mutate(status = "noharm")
wide_data_harm = wide_data_harm %>% mutate(status="harm")

wide_data_all = rbind(wide_data,wide_data_harm)

justvals_all = wide_data_all %>% ungroup() %>% dplyr::select(!c(contactid,status))

pca_all = prcomp(justvals_all)    

wide_data_all$pc1 = pca_all$x[,1]
wide_data_all$pc2 = pca_all$x[,2]


pcaplot = ggplot(wide_data_all) + 
  geom_point(mapping=aes(x=pc1,y=pc2,color=status)) +
  labs(x = "PC1 (18.1%)", y = "PC2 (12.2%)") +
  theme_bw()


ggplotly(pcaplot)
#pc1 biggest loadings
pca_all$rotation[,1][abs(pca_all$rotation[,1]) %in% tail(sort(abs(pca_all$rotation[,1])), 5)] 
##   Apodytes dimidiata       Hypoestes-type Prunus africana-type 
##           0.07518588           0.07624676           0.07638038 
##   Rhamnaceae undiff.          Teclea-type 
##           0.07645331           0.07559489
#pc2 biggest loadings
pca_all$rotation[,2][abs(pca_all$rotation[,2]) %in% tail(sort(abs(pca_all$rotation[,2])), 5)] 
## Amaranthaceae    Cyperaceae     Ericaceae       Poaceae    Podocarpus 
##    -0.1215094    -0.1259230    -0.1220127    -0.1261172    -0.1217881
#pc3 biggest loadings
pca_all$rotation[,3][abs(pca_all$rotation[,3]) %in% tail(sort(abs(pca_all$rotation[,3])), 5)] 
##            Hydrocotyle             Rubia-type                  Xyris 
##             -0.1075930             -0.1304211             -0.1023981 
## Plantago africana-type    Rubus pinnatus-type 
##             -0.1273211             -0.1155943

4 Conclusion

Taxon harmonization is a fundamental step in fossil pollen analysis. This tutorial has demonstrated the use of taxon harmonization in tracking Olea capensis elevation shifts over the last 20,000 years, and has shown how much more similar the vocabularies of African fossil pollen analysts become when the African Pollen Database’s taxon harmonization table is applied to them.