C Additional exercises

C.1 Introduction to R and RStudio

C.1.1 Rob’s account book

The young master student Rob Stat thinks seriously about his mother’s advice to monitor his expenses. He begins by writing down what he spends during the week in the Mensa:


Table C.1: Rob’s account book
Day of week Amount spent (€)
Monday 2.57
Tuesday 2.90
Wednesday 2.73
Thrusday 3.23
Friday 3.90
  1. Generate a vector of Rob’s expenses and assign it to the variable expenses. Use the function c() and use the numeric expenses only, not the days of the week.
  2. How much did Rob spend during this week? Use the function sum().
  3. Rob seems to have spent the smallest amount on Tuesday. How much would he have spent if he paid that much every day of the week? Use the array notation with the square brackets.

Unfortunately, Rob misspelled the amount on Tuesday. Actually, he invited his girl friend for lunch and paid 7.95 € instead of 2.90 €.

  1. Correct Rob’s typo.
  2. How does the result change?

C.1.2 Missing values

R codes missing values as NA. Rob ate in the Mensa last Monday, but forgot to write down the amount.

Table C.2: Rob’s account book, cont.
Day of week Amount spent (€)
Monday, 9 March 2.57
Tuesday, 10 March 2.90
Wednesday, 11 March 2.73
Thrusday, 12 March 3.23
Friday, 13 March 3.90
Monday, 16 March NA
  1. How does NA change the calculated sum?
  2. Read what happens when the data contains NAs by calling help on sum, i.e. type ?sum in R console.
  3. Correct your call to sum() accordingly.

C.1.3 Your very first plot

In particular in the beginning of learning R you should not forget why you are doing it. Because R is really beautiful and you want to analyse and learn from real data.

Even if you don’t fully understand the following code, just copy and paste it into your .R file and let it run!

library(tidyverse)
library(gapminder)

gapminder2007 <- gapminder %>% 
  filter(year == 2007)

ggplot(gapminder2007, aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
  geom_point() +
  scale_x_log10() +
  xlab('GDP per capita') +
  ylab('Life expectancy') +
  labs(title = 'Gapminder data for the year 2007')
  1. What is the data set about. Use help like this ?gapminder.
  2. What do the colours represent?
  3. What does the size of the circles represent?
  4. How would you describe the relationship between the GDP per capita and Life expectancy?

C.2 The big practical: importing, wrangling, summerizing and plotting

C.2.1 Temperature along the Dutch coast

The file Temperatur.csv from the book by Zuur, Ieno, and Meesters (2009) contains measurements of temperature, salinity and content of chlorophyll a at 31 locations along the Dutch coast. You can download the data set here. The data is provided by the Dutch institute RIKZ (monitoring program MWTL: Monitoring Waterstaatkundige Toestand des Lands), was measured between 1990 and 2005 between 0 and 4 times per month depending on the season.

  1. Read the file Temperatur.csv into R.
  2. Convert the column Date to a proper date format. Use the library lubridate.
  3. Calculate the number of measurements, mean and standard deviations of temperature per monitoring station. Hint: use n() inside summarize() to get the number of measurements.
  4. Calculate the number of measurements, mean and standard deviations of temperature per month.
  5. Plot the mean monthly temperature as a line and add the standard deviation as a band around it.
  6. Label the axis appropriately.
  7. Save you graph as a pdf file.

C.2.2 Temperature along the Dutch coast, revisited

  1. Calculate the monthly means and standard deviations per monitoring station. Hint group_by(Station, Month).
  2. Plot the means with an error band in different plots. Hint: use facet_wrap()).
  3. Save you graph as a pdf file.

C.2.3 Excel data turns tidy

We will import and tidy World Development Indicators data downloaded from the World Bank on 2021-06-09 for 20 countries. This is an extract only and more data is available.

This exercise will show you how to load excel data directly without converting it to .csv file. The format of the data is a typical non-tidy one and you will wrangle it to a tidy tibble. The file is called Data_Extract_From_World_Development_Indicators.xlsx.

  1. The goal of this exercise is to learn how to read data from an excel file using tidyverse functions. We will use the library readxl and the function read_xlsx() for reading such files. Read the help pages of read_xlsx(), find and set the parameter for reading a particular table sheet correctly.
  2. Open the excel sheet and look through the data carefully. How are NAs coded? Which data sheet do you need to read?
  3. Read the excel file into R. Call it wdi.
  4. This data set is not tidy. In particular, the year is coded as column name. Those column names contain the year twice, once as a number and once as [YR NUMBER]. We rename the columns first.
wdi <- wdi %>% 
  rename_with(.fn = function(x) str_sub(x, start = 1, end = 4), .cols = starts_with('20'))

What does this code mean? Read the help pages for functions rename_with(), str_sub() and starts_with().

  1. Pivot the data set to a tidy format: variables in columns and measurements in rows. Use pivot_longer.
wdi <- wdi %>%
  pivot_longer(names_to = 'year', values_to = 'indicator_value', cols = starts_with('20')) %>% 
  mutate(year = as.numeric(year))

wdi

What does this code mean? Read the help pages for functions pivot_longer() and as.numeric(). Why is it necessary to convert year with as.numeric()?

  1. Filter for the indicator GDP (current US$) and plot the data as time series. Hint: You can also filter for the indicator’s code; look it up in the excel file. Label the axis appropriately.

C.3 The big practical: statistical inference

C.3.1 Species richness in grasslands

You will work with grassland species monitoring data from the Yellowstone National Park provided by Zuur, Ieno, and Meesters (2009) and Sikkink et al. (2007). You can download the data set here. The researchers monitored the changes in the grassland communities over time and related them to environmental factors. Biodiversity is expressed as the number of different species per site (variable R). Approximately 90 species were identified in 8 transects in monitoring campaigns repeated every 4 to 10 years, resulting in 58 observations. The data is saved in the file Vegetation2.xls.

  1. Read the data and explore its structure. Describe the type of variables. Does the type correspond to your expectation for the respective variable? Remember to set the name of the table sheet you want to read in read_xls().

  2. Short explorative data analysis: calculate the number of measurements, the mean and the standard deviations of species number R per transect.

  3. Plot the species number versus the variable BARESOIL (proportion of bare soil). Colour the dots by transect. Hint: convert the transect as_factor().

  4. Add a smoothing line without a confidence band (geom_smooth(se = FALSE)) through all points independently of the transect. You might want to consult Section 4.2 in the book ggplot2 (Wickham 2020). Hint: set the colour aes in geom_point() instead of in ggplot().

  5. Add labels to your graph and assign it to an object.

  6. Plot the species number as a time series by transect. Add both, points and lines. The size of the points should reflect the proportion of bare soil. You might want to consult Section 12.1 in the book ggplot2 (Wickham 2020). Think about where to set the aesthetic size in order to scale the points only.

  7. Add labels to your graph and assign it to an object.

  8. Put both graphs side-by-side and save them as a pdf (ggsave()).

  9. Calculate the linear Pearson correlation coefficient between the species number and the proportion of bare soil. Calculate the 95% confidence interval. Use the framework in infer.

  10. If you calculate the 90% confidence interval instead of the 95% confidence interval, does the confidence interval increase or decrease? Why?

C.3.2 Soil compaction

Heavy agricultural machines could compact the soil. In a randomized field trial, plots (variable plots) on a homogeneous agricultural field were assigned randomly either to the control (control) or to treatment where a heavy agricultural machine was used (compacted). Bulk density in [g/cm³] (mass of dry soil divided by soil volume) was measured on every plot. It is a parameter for soil structure and can help to spot soil compaction. The data is stored inbd_compaction.csv.

  1. Read the data and do a short explorative data analysis.
  2. Did the bulk density increase due to heavy machinery or could the difference be due to chance? Use the framework in infer.
  3. Calculate the effect size (difference in means) and its 95% confidence interval.