Week 2: Part 2 - Testing your Hypothesis

OVERVIEW

In the next couple of sections (Parts 2 & 3) you will use actually data from the Snapshot Serengeti to test your hypothesis about wildebeest distributions. To do this, we need to know what the data look like and how to visualize and interpret it.
These sections are important because you will use the same data and visualizations to answer your own research question for the main part of the lab. So be sure to follow along carefully!

Introduction to the Snapshot Serengeti Dataset

The dataset that we will be using is derived from ~208,000 photographs taken over a 5 year period (2010–2015).
Based on these photographs, there are ~232,000 unique observations of 74 different animal species. Some photographs capture multiple species, which is why the number of observations are greater than the number of photographs.
This is a huge dataset! These types of datasets can be daunting to analyze. Where do you even start?
The next four tasks will walk you through what is in the dataset, how to think it about analytically, and how to answer your hypothesis on wildebeest migration:
  • Task 1: What does the data look like? (10 mins)
  • Task 2: Linking the Serengeti with the spreadsheet (20 mins)
  • Task 3: How to use the Snapshot Serengeti dataset to answer a question (15 mins)
  • Task 4: Testing your wildebeest hypothesis (45 mins)

Task 1: What does the data look like?

Watch The Intro to Snapshot Serengeti Data Video. This video will give you a brief description of what the raw data looks like.
Next, explore the data using the dataset embedded below (Screenshot, CodePen). Use the scroll bars to explore the data.
Keep in mind, that these are actual data from the Snapshot Serengeti study site. Moreover, the data you collected in Part 1 on the Snapshot Safari website will ultimately be added to a dataset just like this one!
Make sure you understand the dataset structure and variables before you move on to the next step.
Review the Data Description document for detailed descriptions of each variable in the dataset. As you formulate your own hypothesis later in the lab, you will want to consult this resource.
Intro to Snapshot Serengeti Data
vimeo.com ›
Preview of the interactive data table.
Interactive table of the first 200 entries in the Snapshot Serengeti dataset. Click the link at the bottom to open the table. Spend some time exploring the data. You can see the original image by clicking on the "CaptureEventID". Make sure that you understand what each row of data represents and how it relates to the associated image.
ocelots-rcn.github.io ›
Interactive table of the first 200 entries in the Snapshot Serengeti dataset. Click "Run Pen" to load the table. Spend some time exploring the data. You can see the original image by clicking on the "CaptureEventID". Make sure that you understand what each row of data represents and how it relates to the associated image.
codepen.io ›
Description of variables in the Snapshot Serengeti Dataset
docs.google.com ›

Stop & Reflect

Answer the following questions to assess your understanding:
  • What does each row in the dataset represent?
  • What does each column in the dataset represent?
  • For many of the positional and environmental variables in the dataset, why are there see many repeats in a row?
  • What variables will change with each unique observation and what variables will stay the same?

Task 2: Linking the Serengeti with the spreadsheet

For this task, we will link the data above to the photos that it was collected.
You can explore the full Snapshot Serengeti Dataset including the images at the Search Serengeti website. These data are the same data that we will be using for this lab.
Here, we will explore these photos to get a better sense of what the data are actually telling us. To do this, we will be searching for specific photos based on their unique Capture Event ID. The CaptureEventID is listed in the first column of the dataset above.
Let’s start by looking up a specific photo. To do so, follow these steps:
  1. Go to the Search Serengeti website (link above)
  2. Search for the 2nd photo (row 2) in the data, CaptureEventID = ASG0002kjn
To search for a photo, enter the CaptureEventID in the search bar in the upper right-hand corner (it will say “Search Capture ID”).
Does the data listed in the dataset match the data you see in the photo?
Now, let’s compare different data types to see how the photos differ. Search for the following photos below and note the differences.
Comparison of binary behavioral variables (present/absent):
  • An impala not Eating (ASG000000d) vs. an impala Eating (ASG000000e)
  • Elephants with no Babies (ASG000000t) vs. elephants with Babies (ASG0002kl6)
  • Zebras Interacting (ASG0002ldi) vs. zebras not Interacting (ASG0002kl2)
Comparison of Habitat types
  • Dense woodland (ASG0002kjt)
  • Open Woodland/Shrubs (ASG000000a)
  • Grassland w/Trees (ASG0002la4)
  • Open Grassland (ASG0002d6n)
Photo from the Search Serengeti website. Zebra at night.
Photo credit:
SnapshotSafari.org | CC BY-NC-SA 3.0
Photo from the Search Serengeti website. Click the link below to open the site and compare the listed images.
searchserengeti.umn.edu ›

Stop & Reflect

Answer the following questions to assess your understanding:
  • For each image, does the data coding match what you see in the image?
  • If you were looking at a new photo of a giraffe and there was a baby present, how would enter this into the dataset?
  • Can you think of any issues with coding data in discrete categories like Habitat?
  • In addition to Habitat types, the data also include continuous measurements of Tree.Density.Measure. Why have both?

Task 3: How to use the Snapshot Serengeti dataset to answer a question

At this point, you have collected camera trap data, reviewed the raw camera trap dataset, and linked the data in the dataset to actual photos!
You should understand that the fundamental data we get from camera traps is the presence and number of individuals of a given species at a specific time and location.
As you can imagine, there are a lot of things we can do with this type of data! We are going to use your hypothesis on wildebeest as a case study to work through different ways of visualizing and analyzing these data.
Let’s start with an example: The graph below is a map that was made with data on wildebeest abundance taken from a larger Snapshot Serengeti dataset.
  • The X and Y-axes are longitude and latitude, respectively, presented as meters.
  • Each point represents the location of a camera trap
  • The size of each circle represents the relative number of wildebeest observed at each site.
  • Finally, we have split the data into observations from the Wet Season (Nov. – Apr.; blue) vs. the Dry Season (Jun. – Dec.; orange).
This graph allows us to compare the relative distribution of wildebeest across the park during different times of the year.
A distribution map of Wildebeest at the camera trap sites in the wet and dry seasons.
Points are camera traps at which wildebeest were observed. Size of circle is the relative number of wildebeest seen at each site in each season. Orange circles represent dry season observations (Nov - Apr), Blue circles represent wet season observations (June-December).

Stop & Reflect

Answer the following questions to assess your understanding:
  • What corner of the map represents the North-West corner of the camera trap site?
  • What overall pattern do you see in the data?
  • Are wildebeest more concentrated in one season than the other?
  • Are the patterns you observe consistently?
  • Look back a map of the camera trap locations in Introduction: Snapshot Serengeti Research Site. Is there any feature of the landscape that might explain the pattern shown above?
  • What new questions does this graph raise?

A note on Box-and-whisker Plots

Another way to answer a question with data is to compare the means of two or more groups.
Box-and-whisker plots are a powerful way of comparing means because they also show the variation within each group.
The “box” of the box-and-whisker plot shows the range of the middle 50% of the data (25-75% percentile), also known as the interquartile range. The solid line in the middle of the box is the median value of the data. Some box-and-whisker plots will also indicate the mean of the data with a diamond or dot. The whiskers indict 1.5X the interquartile range (~5–95% percentile, though it depends on the data’s distribution). Finally, the dots indicate outliers, data points that fall outside the boundary of the whiskers.
Check out the following diagram for a visual overview of a box-and-whisker plot.
A description of the various components of a box and whisker plot.
Photo credit:
USGS | Laura DeCicco
A description of the components of a box-and-whisker plot.

Task 4: Testing your wildebeest hypothesis

The map of wildebeest distribution in different seasons above is part of the answer to your question. But it is not the whole answer.
Next, we will add to this piece of the puzzle by building a box-and-whisker plot.
In combination, these two visualizations should provide a reasonable test of your hypothesis about the distribution of wildebeest.
We will be using R, a powerful, open-source statistical programming language to analyze data from the Snapshot Serengeti Camera Traps. R is a common tool used by scientists and data analysts worldwide. If you want to learn more, check out the link in this citation.
But don’t worry! You won’t actually have to do any programming—at least directly. Instead, you will be using online, interactive graphing tools that analyze and plot the data for you!
To start, we will use the interactive graphing tool below to generate box-and-whisker plots. Remember, these plots are generated using real data from the Snapshot Serengeti site described above. We’ll learn more about this data next week.
For each camera trap, there is a set of environment variables for that location. These data can be used to compare associations between animal behavior and abundance with environmental factors as well as compare differences in environmental factors across the site and seasons. This graphing tool below summarizes the environmental data associated with observations of Wildebeest made in the wet and dry seasons using a box-and-whisker plot.
Now let’s consider your hypothesis. The tool below shows a limited number of environmental variables and only allows for a comparison of months over the entire year. Take a few minutes to explore the tool and data. What happens when you select different variables? What does the resulting graph tell you? Make sure to think about the structure of the data.
Ideally, to test a hypothesis, you would want to generate a graph or set of graphs that tell a story that either supports your hypothesis or causes you to reject it. In many cases, you would create multiple graphs representing different aspects of the data to support a single conclusion.
Can your hypothesis be tested using the data tool below? If not, can some aspect of your hypothesis be tested? Use the tool to generate graphs and save them with the Download button.
  • Make sure your interpretation of the graphs is correct
  • If you submit multiple graphs, make sure that you are clear about how they relate to your hypothesis
  • If there is no graph that matches your hypothesis, describe a graph that would help you to interpret the data
Screenshot of an interactive box-and-whisker plot
An interactive box-and-Whisker plot to test your wildebeest hypothesis. Click the link below to open the interactive graphing tool.
ocelots-rcn.github.io ›

ASSIGNMENTS

Share your graph(s) with your team by posting it to the Serengeti Lab.2 Graph Discussion board on Canvas. Write a caption for each graph.
Discuss the degree to which your hypothesis is supported by your graphs. Work with your team to settle on a final set of graphs that best tests your hypothesis.
Submit a single set of graphs with captions as a team to Serengeti Lab.2 Wildebeest Graph Assignment on Canvas.
internationalscienceediting.com ›

Continue to Part 3

Next: Week 2: Part 3 - Visualizing the Dataset