Tutorial 4 Data preparation

In the previous Tutorials, we use simulated data to introduce the concept of variability in activity patterns (Tutorial 2) and to illustrate the use of hierarchical models for estimating activity patterns (Tutorial 3). We now switch gears. In this and the following three Tutorials, we fit trigonometric and cyclic cubic spline hierarchical models to empirical camera-trap data to address common ecological questions related to the study of activity patterns.

In this Tutorial, we explain how to prepare data to analyze activity patterns from camera-trap datasets using hierarchical models. This procedure is applied throughout the tutorial every time we use a new set of real data. To minimize redundancy, we illustrate the procedure in detail here only once, and then refer to the code (and possibly extend it), when necessary.

We start by loading the data and showing a preview of what it looks like. Here, we select camera-trap records of coyotes (Canis latrans). We also load some libraries to wrangle the dataset and get it in the desired format.

# Load libraries
rm(list = ls())
set.seed(129)
library(dplyr)
library(lubridate)

# Load data
coy <- read.csv("data_input/species_records.csv") %>% 
  filter(Species == "Coyote") %>% 
  droplevels %>% 
  select(-X) %>% 
  mutate(DateTimeOriginal = ymd_hm(DateTimeOriginal))

head(coy)

##   Station Species    DateTimeOriginal       Date     Time    Session
## 1     10E  Coyote 2016-05-20 20:30:00 2016-05-20 20:30:51 Spring2016
## 2     10E  Coyote 2016-05-20 20:30:00 2016-05-20 20:30:51 Spring2016
## 3     10E  Coyote 2016-05-20 20:30:00 2016-05-20 20:30:52 Spring2016
## 4     10E  Coyote 2016-05-20 20:30:00 2016-05-20 20:30:58 Spring2016
## 5     10E  Coyote 2016-05-20 20:30:00 2016-05-20 20:30:59 Spring2016
## 6     10E  Coyote 2016-05-20 20:30:00 2016-05-20 20:30:59 Spring2016

To estimate activity patterns while accounting for site-to-site variability, at a minimum, we need information on the location id (or geographic coordinates) of the spatial sampling site (e.g., camera-trap site) and the time of day in which the different encounter events for a certain species occurred. In this dataset, each row corresponds to one image. We are interested in the location id and the time of day in which each event has been recorded. These values are stored in the columns named Station and Time, respectively. Because we set up our cameras to collect a burst of three pictures every time a camera was triggered, some images have the exact same time and location (see for example the first and second rows above).

Individual or groups of animals often spend several minutes in front of a camera, triggering the sensor several times in a short period of time. This leads to highly autocorrelated camera-trap encounter events (once a camera has been triggered, it is likely that it will be triggered again in the following few minutes). When using KDEs, the resulting short-term autocorrelation will impact the choice of smoothing parameter. To reduce the level of autocorrelation, camera-trap data are often aggregated based on some time threshold (often between 1 and 60 minutes, commonly 30; Burton et al. (2015) and Iannarilli et al. (2021)); images collected at the same site and within this time threshold are combined into a single unique encounter event. This process, often referred to as data aggregation, is often highly recommended (although not strictly required, see discussion in Peral, Landman, and Kerley (2022)) as a preparatory step before applying KDEs. We compare KDEs with and without data aggregation in Tutorial 9.3.

Data aggregation is not necessary when using trigonometric and cyclic cubic spline hierarchical models. Thus, we use the raw (non-aggregated) dataset in all the empirical case studies presented in these tutorials, except Tutorial 9.4, where we apply data aggregation to facilitate comparison of estimated activity patterns using hierarchical models and KDEs.

Unlike KDEs, trigonometric and cyclic hierarchical models allow us to leverage not only the times an encounter was recorded but also the times in which cameras were active at a location and no encounter occurred. This, in turn, allows us to estimate activity while accounting for variable sampling effort and makes comparisons of relative probabilities of activity across sites (or e.g., environmental and anthropogenic conditions) possible.

To account for sampling effort, we have to build a matrix that stores information about when a camera was active at each site, and in how many of these days (or shorter occasions, see later) we recorded the target species.

We load a new dataframe that contains information on the start and the end of the sampling period at each site. The start of this period usually corresponds to the deployment date, while the end of the period is the date of retrieval or the last day the camera was working at each location, if failure or malfunctioning occurred (e.g., empty batteries, SD card full, or cameras displaced by an animal).

cov <- read.csv("data_input/CameraTrapProject_CT_data_for_analysis_MASTER.csv", as.is = TRUE) %>% 
  select(Session, Site, Date_setup, Date_retr, Problem1_from, Problem1_to) %>% 
  mutate(Date_setup = mdy(Date_setup),
         Date_retr = mdy(Date_retr),
         Problem1_from = mdy(Problem1_from),
         Problem1_to = mdy(Problem1_to)) 
head(cov)

##      Session Site Date_setup  Date_retr Problem1_from Problem1_to
## 1 Spring2016   1A 2016-05-12 2016-06-22          <NA>        <NA>
## 2 Spring2016   1B 2016-05-03 2016-06-22          <NA>        <NA>
## 3 Spring2016   1C 2016-05-03 2016-07-08          <NA>        <NA>
## 4 Spring2016   1D 2016-05-03 2016-06-24          <NA>        <NA>
## 5 Spring2016   1E 2016-05-03 2016-06-22          <NA>        <NA>
## 6 Spring2016   2A 2016-05-12 2016-06-23          <NA>        <NA>

This dataframe contains identifiers for the sampling session (Session), the camera-trap site (Site), the date a camera was deployed (Date_setup) and retrieved (Date_retr) from the corresponding site. For cameras that were not functioning at the time of retrieval, we also have information regarding the period the camera was inactive (Problem1_from and Problem1_to). This structure follows the one accepted in the camtrapR package (Niedballa et al. 2016).

In this format, the information about the last day a camera was active is spread across two columns, Date_retr (for cameras that did not fail) and Problem1_from (for cameras that did fail). We bring this information together in one column called end.

# Merge time of deployment and retrieval + problems
site <- cov
site$end <- ymd("2000-01-01")
for(i in 1:nrow(site)){
  site$end[i] <-  min(site$Date_retr[i], 
                      site$Problem1_from[i], 
                      na.rm = TRUE
                      )
}

Then, we create a dataframe to store the information about when the target species was and was not encountered at the different locations within each temporal observation window. Here, we use 60 minutes to define the temporal sampling units, but different duration (from seconds to several hours) can be used. Shorter lengths will increase model fitting computation time and could potentially lead to parameter convergence problems if data are too sparse. In the code below, the argument by = '60 min' can be changed to accommodate other observation intervals (e.g., by = '30 min' leads to 30-min intervals).

For each site, we create a list of all the hourly occasions from the start to the end of the deployment of the active camera at the site. We then pre-populate the column capt (which will track the encounter/non-encounter information) with zeros.

# Create dataframe to store captures 
# (model does not converge if using 30 minutes)
occasions <- vector("list", length = nrow(site))
for(i in 1:nrow(site)){
  occasions[[i]] <- data.frame(Session = site$Session[i],
                               Site = site$Site[i],
                               start = seq(from = ymd_hms(paste(site$Date_setup[i], "00:00:00", sep = " ")), 
                                           to = ymd_hms(paste(site$end[i], "23:59:59", sep = " ")), 
                                           by = '60 min'
                                           )
                               ) %>% 
    mutate(end = c(start[2:length(start)], 
                   start[length(start)]+minutes(60)
                   )
           ) 
}
occasions <- do.call(rbind.data.frame, occasions)
occasions$capt <- 0
head(occasions)

##      Session Site               start                 end capt
## 1 Spring2016   1A 2016-05-12 00:00:00 2016-05-12 01:00:00    0
## 2 Spring2016   1A 2016-05-12 01:00:00 2016-05-12 02:00:00    0
## 3 Spring2016   1A 2016-05-12 02:00:00 2016-05-12 03:00:00    0
## 4 Spring2016   1A 2016-05-12 03:00:00 2016-05-12 04:00:00    0
## 5 Spring2016   1A 2016-05-12 04:00:00 2016-05-12 05:00:00    0
## 6 Spring2016   1A 2016-05-12 05:00:00 2016-05-12 06:00:00    0

For each coyote observation collected during the study, we assign the value 1 to the column capt for the row corresponding to the site and time interval of the encounter event.

#' Store captures
for(i in 1:nrow(coy)){
  occasions[occasions$Session == as.character(coy$Session[i]) 
            & occasions$Site == as.character(coy$Station[i]) 
            & occasions$start <= coy$DateTimeOriginal[i] 
            & occasions$end > coy$DateTimeOriginal[i], "capt"] <- 1
}
table(occasions$capt)

## 
##      0      1 
## 570925    251

We have 251 and 570 925 1-hour long occasions with and without encounters of coyotes, respectively.

Because we chose to use hourly occasions, we can extract the information about the hour of each interval and use it as the Time variable in the models that we will run in the following Tutorials. If a different time length is used (e.g., 1-minute long), this step needs to be adjusted accordingly. We also code Site as a factor variable.

# Format data 
occasions$Time <- hour(occasions$start)
occasions$Site <- as.factor(occasions$Site)
nrow(occasions)

## [1] 571176

Our dataset, occasions, now contains more than 570,000 rows. Running an hierarchical model on such a large dataset might take a long time or require using a computing cluster. We can speed up model fitting by first summarizing the encounter/non-encounter records across the sampling period, counting the number of successes (i.e. intervals with encounters) and failures (i.e. intervals without encounters) at each site and hourly temporal occasion. We can then run the hierarchical models using the cbind(success, failure) approach (see next Tutorial for examples). However, this approach will not work in cases where we want to model variation in activity as a function of additional temporally-varying covariates (e.g., Julian day). In those situations, we may be forced to work with the occassions data set and its binary encounter records. Below, we provide code for creating a dataset containing the number of successes and failures for each observation window and site.

# format data for cbind(success, failure)
occasions_cbind <- occasions %>% 
  group_by(Site, Time) %>% 
  summarize(success = sum(capt),
            failure = n() - success
            )

## `summarise()` has grouped output by 'Site'. You can override using the `.groups` argument.

head(occasions_cbind)

## # A tibble: 6 × 4
## # Groups:   Site [1]
##   Site   Time success failure
##   <fct> <int>   <dbl>   <dbl>
## 1 10A       0       0     259
## 2 10A       1       0     259
## 3 10A       2       0     259
## 4 10A       3       0     259
## 5 10A       4       0     259
## 6 10A       5       0     259

The first row indicates that for Site 10A and Time interval 0 (i.e. from 00:00:00 to 00:59:59) we had 259 occasions (i.e. days in this case) without detecting a coyote and 0 occasions where we encountered a coyote. In the next Tutorial (Tutorial 5), we see how to use this dataset to test hypotheses about coyotes’ diel activity patterns.