This exercise is designed as an introduction to tidyverse from the very basics. Tidyverse is a collection of R packages used for data manipulation and visualization. We are mainly focused on dplyr
and ggplot2
(two most important packages of tidyverse) although we can use functionalities from other packages as well.
Suppose you are an frequent traveler and weather details are important to you because of what to pack to wear in your travels. Our data consists of temperature (in Celsius) history of 4 popular travel destinations (NYC, Amsterdam, London and Venice) between November 2015 and October 2017. Raw data is gathered from Weather Underground and it is only for educational purposes. You are going to explore this data set using the most common tidyverse functions. You will be asked to fill the missing information.
Tip: You can always check the help files of the functions by writing ?
in front of the function name (e.g. ?select
) in the R Console, after you load the package.
First we are going to install tidyverse
and load it. Installing a package is a one time job, essentially equivalent to downloading from server. Though, in each session you need to load the package with either library
or require
functions. For this tutorial you also need to download the travel_weather.RData
file from below link.
# Install the package if you already haven't
install.packages("tidyverse", repos = "https://cran.r-project.org")
# Load the package to the session
library(tidyverse)
# Set your working directory (the directory which you keep
# the travel data (travel_weather.RData)
setwd("~/MyWorkingDirectory/")
# Load the data set file
load("travel_weather.RData")
Main data type of this tutorial is a data.frame
, or more properly a tibble
. Data frames are two dimensional, efficient data tables which every column can consist of different data types (i.e. characters, factors, numeric, logical). tibble
is a special data frame type that comes with tidyverse package but the functionality is very similar (no difference for this tutorial).
Now let’s take a look at our data.
travel_weather %>%
tbl_df()
## # A tibble: 731 x 7
## year month day Amsterdam London NYC Venice
## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11.0 1.00 8.00 8.00 16.0 13.0
## 2 2015 11.0 2.00 10.0 11.0 15.0 10.0
## 3 2015 11.0 3.00 9.00 11.0 16.0 9.00
## 4 2015 11.0 4.00 12.0 11.0 17.0 10.0
## 5 2015 11.0 5.00 13.0 13.0 18.0 12.0
## 6 2015 11.0 6.00 16.0 14.0 21.0 13.0
## 7 2015 11.0 7.00 16.0 14.0 17.0 14.0
## 8 2015 11.0 8.00 12.0 12.0 11.0 13.0
## 9 2015 11.0 9.00 13.0 12.0 11.0 11.0
## 10 2015 11.0 10.0 14.0 14.0 12.0 11.0
## # ... with 721 more rows
Did you notice the %>%
? It is called the pipe operator. It starts with the data and connects the operations in the given order (top to bottom or left to right). (Tip: You can add line breaks between the operations but pipe operator should always be at the end of the line.)
There are some tibble
properties you should be aware of. At the first line number of rows and columns are reported (A tibble: 731x7). Also under each column, its data type is given. This way we can be notified of the essentials of this data frame.
A more proper check can be done with glimpse
function. glimpse
is especially useful if the number of columns is high.
glimpse(travel_weather)
## Observations: 731
## Variables: 7
## $ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015...
## $ month <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, ...
## $ day <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ Amsterdam <dbl> 8, 10, 9, 12, 13, 16, 16, 12, 13, 14, 13, 13, 11, 11...
## $ London <dbl> 8, 11, 11, 11, 13, 14, 14, 12, 12, 14, 13, 12, 10, 1...
## $ NYC <dbl> 16, 15, 16, 17, 18, 21, 17, 11, 11, 12, 12, 13, 11, ...
## $ Venice <dbl> 13, 10, 9, 10, 12, 13, 14, 13, 11, 11, 9, 11, 8, 11,...
Our data consists of 731 rows and 7 columns. Each row represents a day. First three columns (year
,month
and day
) define the date. Last four columns (Amsterdam
, London
, NYC
and Venice
) represent the average temperature of the cities in the given day.
Now let’s explore.
We are going to see the fundamental functions of dplyr and then some more. It would be very good for you if you follow this tutorial with the dplyr cheat sheet. You can download it from here.
Our fundamental functions are as follows.
select
/rename
filter
arrange
mutate
/transmute
group_by
/summarise
We will start really simple and build up.
select
function, as the name suggests, selects the columns. rename
just renames the columns.
1. Let’s start with only one city: Venice. Select the date components (year, month, day) and Venice column. Fill the YOURANSWERHERE
in your code in order to replicate the result.
travel_weather %>% select(year, month, day, YOURANSWERHERE)
## # A tibble: 731 x 4
## year month day Venice
## * <dbl> <dbl> <dbl> <dbl>
## 1 2015 11.0 1.00 13.0
## 2 2015 11.0 2.00 10.0
## 3 2015 11.0 3.00 9.00
## 4 2015 11.0 4.00 10.0
## 5 2015 11.0 5.00 12.0
## 6 2015 11.0 6.00 13.0
## 7 2015 11.0 7.00 14.0
## 8 2015 11.0 8.00 13.0
## 9 2015 11.0 9.00 11.0
## 10 2015 11.0 10.0 11.0
## # ... with 721 more rows
2. Now let’s say you want to have only the cities. You can either write the names of all cities or specify a range with :
.
travel_weather %>% select(YOURANSWERHERE1:YOURANSWERHERE2)
## # A tibble: 731 x 4
## Amsterdam London NYC Venice
## * <dbl> <dbl> <dbl> <dbl>
## 1 8.00 8.00 16.0 13.0
## 2 10.0 11.0 15.0 10.0
## 3 9.00 11.0 16.0 9.00
## 4 12.0 11.0 17.0 10.0
## 5 13.0 13.0 18.0 12.0
## 6 16.0 14.0 21.0 13.0
## 7 16.0 14.0 17.0 14.0
## 8 12.0 12.0 11.0 13.0
## 9 13.0 12.0 11.0 11.0
## 10 14.0 14.0 12.0 11.0
## # ... with 721 more rows
3. This time we are going to use (-
) to remove unwanted columns. Suppose we do not want NYC or London columns.
travel_weather %>% select(-YOURANSWERHERE1, -YOURANSWERHERE2)
## # A tibble: 731 x 5
## year month day Amsterdam Venice
## * <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11.0 1.00 8.00 13.0
## 2 2015 11.0 2.00 10.0 10.0
## 3 2015 11.0 3.00 9.00 9.00
## 4 2015 11.0 4.00 12.0 10.0
## 5 2015 11.0 5.00 13.0 12.0
## 6 2015 11.0 6.00 16.0 13.0
## 7 2015 11.0 7.00 16.0 14.0
## 8 2015 11.0 8.00 12.0 13.0
## 9 2015 11.0 9.00 13.0 11.0
## 10 2015 11.0 10.0 14.0 11.0
## # ... with 721 more rows
4. Now we just want to rename NYC to New York. Although it is not advised to use spaces in your column names, you can do it by taking it between backticks. Remember rename
will not select any column, just change the name of the specified column.
travel_weather %>% rename(`YOUR ANSWER HERE` = NYC)
## # A tibble: 731 x 7
## year month day Amsterdam London `New York` Venice
## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11.0 1.00 8.00 8.00 16.0 13.0
## 2 2015 11.0 2.00 10.0 11.0 15.0 10.0
## 3 2015 11.0 3.00 9.00 11.0 16.0 9.00
## 4 2015 11.0 4.00 12.0 11.0 17.0 10.0
## 5 2015 11.0 5.00 13.0 13.0 18.0 12.0
## 6 2015 11.0 6.00 16.0 14.0 21.0 13.0
## 7 2015 11.0 7.00 16.0 14.0 17.0 14.0
## 8 2015 11.0 8.00 12.0 12.0 11.0 13.0
## 9 2015 11.0 9.00 13.0 12.0 11.0 11.0
## 10 2015 11.0 10.0 14.0 14.0 12.0 11.0
## # ... with 721 more rows
Tip: You can also use rename functionality with select
.
Filter returns rows with the given criteria. You can define any criteria and combine conditions with the “and” (&
) and “or” (|
) operators. You can use other operators such as less than (or equal to) (<
,<=
), greater than (or equal to) (>
,>=
), equal to (not equal to) (=
, !=
) and several other operators which return TRUE/FALSE statements as well. You can combine the operations and ensure precedence with parentheses.
1. Suppose we are interested only the first three days of the month.
travel_weather %>%
filter(day <= YOURANSWERHERE)
## # A tibble: 72 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11.0 1.00 8.00 8.00 16.0 13.0
## 2 2015 11.0 2.00 10.0 11.0 15.0 10.0
## 3 2015 11.0 3.00 9.00 11.0 16.0 9.00
## 4 2015 12.0 1.00 9.00 11.0 9.00 6.00
## 5 2015 12.0 2.00 10.0 12.0 11.0 8.00
## 6 2015 12.0 3.00 9.00 11.0 10.0 8.00
## 7 2016 1.00 1.00 4.00 3.00 3.00 2.00
## 8 2016 1.00 2.00 6.00 10.0 2.00 0
## 9 2016 1.00 3.00 7.00 8.00 4.00 3.00
## 10 2016 2.00 1.00 10.0 12.0 11.0 6.00
## # ... with 62 more rows
2. Suppose we are interested in only the dates in November (11th month) which Venice is warmer than NYC.
travel_weather %>%
filter(month == 11 & YOURANSWERHERE)
## # A tibble: 20 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11.0 8.00 12.0 12.0 11.0 13.0
## 2 2015 11.0 14.0 11.0 10.0 8.00 11.0
## 3 2015 11.0 15.0 12.0 14.0 9.00 11.0
## 4 2015 11.0 17.0 13.0 13.0 8.00 9.00
## 5 2015 11.0 23.0 3.00 3.00 4.00 6.00
## 6 2015 11.0 24.0 5.00 8.00 4.00 6.00
## 7 2016 11.0 1.00 10.0 9.00 9.00 11.0
## 8 2016 11.0 6.00 7.00 4.00 11.0 12.0
## 9 2016 11.0 7.00 4.00 6.00 8.00 11.0
## 10 2016 11.0 12.0 1.00 8.00 7.00 9.00
## 11 2016 11.0 19.0 6.00 4.00 10.0 11.0
## 12 2016 11.0 20.0 7.00 7.00 3.00 11.0
## 13 2016 11.0 21.0 10.0 10.0 4.00 12.0
## 14 2016 11.0 22.0 10.0 9.00 4.00 14.0
## 15 2016 11.0 23.0 8.00 7.00 4.00 14.0
## 16 2016 11.0 24.0 6.00 9.00 6.00 13.0
## 17 2016 11.0 25.0 3.00 7.00 10.0 13.0
## 18 2016 11.0 26.0 3.00 6.00 7.00 12.0
## 19 2016 11.0 27.0 5.00 7.00 7.00 11.0
## 20 2016 11.0 28.0 1.00 6.00 7.00 8.00
3. Suppose we are interested in dates whether Amsterdam is warmer than either London or Venice in July (7th month).
travel_weather %>%
filter(month == 7 & (YOURANSWERHERE1 | YOURANSWERHERE2))
## # A tibble: 21 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2016 7.00 2.00 16.0 14.0 21.0 25.0
## 2 2016 7.00 11.0 19.0 18.0 23.0 27.0
## 3 2016 7.00 12.0 18.0 17.0 24.0 28.0
## 4 2016 7.00 13.0 16.0 14.0 26.0 27.0
## 5 2016 7.00 19.0 21.0 20.0 26.0 27.0
## 6 2016 7.00 20.0 27.0 24.0 25.0 26.0
## 7 2016 7.00 21.0 21.0 19.0 27.0 26.0
## 8 2016 7.00 22.0 21.0 19.0 29.0 26.0
## 9 2016 7.00 23.0 22.0 19.0 31.0 26.0
## 10 2016 7.00 24.0 21.0 19.0 29.0 25.0
## # ... with 11 more rows
4. Finally, let’s add some math. Suppose we are interested in dates which the absolute temperature difference between Amsterdam and Venice is greater than or equal to 12.
travel_weather %>%
filter(abs(YOURANSWERHERE) >= 12)
## # A tibble: 6 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2016 6.00 25.0 16.0 15.0 24.0 28.0
## 2 2017 7.00 13.0 14.0 14.0 29.0 27.0
## 3 2017 8.00 2.00 18.0 17.0 26.0 30.0
## 4 2017 8.00 4.00 19.0 18.0 25.0 31.0
## 5 2017 8.00 5.00 17.0 16.0 23.0 31.0
## 6 2017 8.00 6.00 16.0 17.0 21.0 29.0
arrange
is simply ordering of values from A to Z or from smallest to largest. Just write the column names in the order you want to arrange. To employ arrange
in a decreasing order wrap the column of interest between desc(column_name)
function.
1. Arrange the data by the temperature of NYC.
travel_weather %>%
arrange(YOURANSWERHERE)
## # A tibble: 731 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2016 2.00 14.0 2.00 3.00 -14.0 6.00
## 2 2016 2.00 13.0 1.00 2.00 -10.0 4.00
## 3 2016 1.00 5.00 6.00 8.00 - 7.00 2.00
## 4 2017 1.00 9.00 6.00 7.00 - 7.00 -2.00
## 5 2016 1.00 19.0 -2.00 0 - 6.00 1.00
## 6 2016 2.00 12.0 2.00 1.00 - 6.00 6.00
## 7 2016 12.0 16.0 6.00 6.00 - 6.00 4.00
## 8 2017 1.00 8.00 4.00 9.00 - 6.00 -2.00
## 9 2017 1.00 7.00 1.00 8.00 - 5.00 -3.00
## 10 2017 3.00 11.0 7.00 10.0 - 5.00 9.00
## # ... with 721 more rows
2. Arrange the data by the temperature of NYC increasing but Amsterdam decreasing.
travel_weather %>%
arrange(YOURANSWERHERE1,desc(YOURANSWERHERE2))
## # A tibble: 731 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2016 2.00 14.0 2.00 3.00 -14.0 6.00
## 2 2016 2.00 13.0 1.00 2.00 -10.0 4.00
## 3 2016 1.00 5.00 6.00 8.00 - 7.00 2.00
## 4 2017 1.00 9.00 6.00 7.00 - 7.00 - 2.00
## 5 2016 12.0 16.0 6.00 6.00 - 6.00 4.00
## 6 2017 1.00 8.00 4.00 9.00 - 6.00 - 2.00
## 7 2016 2.00 12.0 2.00 1.00 - 6.00 6.00
## 8 2016 1.00 19.0 -2.00 0 - 6.00 1.00
## 9 2017 3.00 15.0 9.00 11.0 - 5.00 10.0
## 10 2017 3.00 11.0 7.00 10.0 - 5.00 9.00
## # ... with 721 more rows
3. Arrange the data by the decreasing date.
travel_weather %>%
arrange(YOURANSWERHERE)
## # A tibble: 731 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2017 10.0 31.0 9.00 9.00 11.0 11.0
## 2 2017 10.0 30.0 8.00 6.00 12.0 13.0
## 3 2017 10.0 29.0 11.0 11.0 18.0 9.00
## 4 2017 10.0 28.0 12.0 10.0 17.0 10.0
## 5 2017 10.0 27.0 12.0 9.00 13.0 13.0
## 6 2017 10.0 26.0 13.0 10.0 13.0 13.0
## 7 2017 10.0 25.0 13.0 14.0 17.0 13.0
## 8 2017 10.0 24.0 13.0 16.0 21.0 13.0
## 9 2017 10.0 23.0 13.0 13.0 20.0 13.0
## 10 2017 10.0 22.0 11.0 11.0 19.0 13.0
## # ... with 721 more rows
4. Finally arrange the data by the temperature difference between London and Amsterdam, increasing.
travel_weather %>%
arrange(YOURANSWERHERE1 - YOURANSWERHERE2)
## # A tibble: 731 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2016 12.0 25.0 10.0 0 6.00 6.00
## 2 2015 12.0 25.0 9.00 0 17.0 4.00
## 3 2016 5.00 31.0 18.0 11.0 26.0 19.0
## 4 2016 6.00 1.00 19.0 12.0 24.0 17.0
## 5 2016 4.00 10.0 10.0 4.00 5.00 16.0
## 6 2016 6.00 7.00 20.0 14.0 24.0 22.0
## 7 2016 5.00 6.00 17.0 12.0 11.0 18.0
## 8 2016 5.00 8.00 21.0 16.0 14.0 17.0
## 9 2016 5.00 10.0 19.0 14.0 14.0 18.0
## 10 2016 6.00 3.00 16.0 11.0 19.0 19.0
## # ... with 721 more rows
mutate
function is used for calculations between columns. transmute
is similar but it adds the select
effect, therefore returning only the columns defined in the transmute
function.
1. Calculate the temperature difference between Venice and Amsterdam.
travel_weather %>%
mutate(VAdiff = YOURANSWERHERE1 - YOURANSWERHERE2)
## # A tibble: 731 x 8
## year month day Amsterdam London NYC Venice VAdiff
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11.0 1.00 8.00 8.00 16.0 13.0 5.00
## 2 2015 11.0 2.00 10.0 11.0 15.0 10.0 0
## 3 2015 11.0 3.00 9.00 11.0 16.0 9.00 0
## 4 2015 11.0 4.00 12.0 11.0 17.0 10.0 -2.00
## 5 2015 11.0 5.00 13.0 13.0 18.0 12.0 -1.00
## 6 2015 11.0 6.00 16.0 14.0 21.0 13.0 -3.00
## 7 2015 11.0 7.00 16.0 14.0 17.0 14.0 -2.00
## 8 2015 11.0 8.00 12.0 12.0 11.0 13.0 1.00
## 9 2015 11.0 9.00 13.0 12.0 11.0 11.0 -2.00
## 10 2015 11.0 10.0 14.0 14.0 12.0 11.0 -3.00
## # ... with 721 more rows
2. Calculate if Venice is warmer than Amsterdam.
travel_weather %>%
mutate(VwarmerA = YOURANSWERHERE1 > YOURANSWERHERE2)
## # A tibble: 731 x 8
## year month day Amsterdam London NYC Venice VwarmerA
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 2015 11.0 1.00 8.00 8.00 16.0 13.0 T
## 2 2015 11.0 2.00 10.0 11.0 15.0 10.0 F
## 3 2015 11.0 3.00 9.00 11.0 16.0 9.00 F
## 4 2015 11.0 4.00 12.0 11.0 17.0 10.0 F
## 5 2015 11.0 5.00 13.0 13.0 18.0 12.0 F
## 6 2015 11.0 6.00 16.0 14.0 21.0 13.0 F
## 7 2015 11.0 7.00 16.0 14.0 17.0 14.0 F
## 8 2015 11.0 8.00 12.0 12.0 11.0 13.0 T
## 9 2015 11.0 9.00 13.0 12.0 11.0 11.0 F
## 10 2015 11.0 10.0 14.0 14.0 12.0 11.0 F
## # ... with 721 more rows
3. If Venice is warmer than Amsterdam write “warmer”, else “colder” and just return the date columns and warmer/colder info.
travel_weather %>%
transmute(year,month,day,
VwarmerA = ifelse(Venice > Amsterdam,YOURANSWERHERE1, YOURANSWERHERE2))
## # A tibble: 731 x 4
## year month day VwarmerA
## <dbl> <dbl> <dbl> <chr>
## 1 2015 11.0 1.00 warmer
## 2 2015 11.0 2.00 colder
## 3 2015 11.0 3.00 colder
## 4 2015 11.0 4.00 colder
## 5 2015 11.0 5.00 colder
## 6 2015 11.0 6.00 colder
## 7 2015 11.0 7.00 colder
## 8 2015 11.0 8.00 warmer
## 9 2015 11.0 9.00 colder
## 10 2015 11.0 10.0 colder
## # ... with 721 more rows
group_by
and summarise
are used for summary tables (sometimes referred to as pivot tables, especially for Excel users). Summarise can be used on its own or with the grouping function group_by
. This part is also the first part which you will use more than one pipe (%>%
).
Tip: If you want to break the grouping, just add the ungroup()
function at the end.
1. Calculate the mean temperatures of Venice and NYC of data period.
travel_weather %>%
summarise(Venice_mean=mean(YOURANSWERHERE1),NYC_mean=YOURANSWERHERE2)
## # A tibble: 1 x 2
## Venice_mean NYC_mean
## <dbl> <dbl>
## 1 14.3 14.4
2. Calculate the mean temperature of Amsterdam for each month. Round the value to two decimals.
travel_weather %>%
group_by(YOURANSWERHERE1) %>%
summarise(Amsterdam_mean=mean(YOURANSWERHERE2))
## # A tibble: 12 x 2
## month Amsterdam_mean
## <dbl> <dbl>
## 1 1.00 3.00
## 2 2.00 4.32
## 3 3.00 6.92
## 4 4.00 8.43
## 5 5.00 14.5
## 6 6.00 17.3
## 7 7.00 18.0
## 8 8.00 17.7
## 9 9.00 16.0
## 10 10.0 11.7
## 11 11.0 7.65
## 12 12.0 6.97
3. Calculate the number of days Amsterdam is warmer than NYC each year and each month.
travel_weather %>%
group_by(year,month) %>%
summarise(AwarmerN_n=sum(YOURANSWERHERE1 > YOURANSWERHERE2))
## # A tibble: 24 x 3
## # Groups: year [?]
## year month AwarmerN_n
## <dbl> <dbl> <int>
## 1 2015 11.0 11
## 2 2015 12.0 12
## 3 2016 1.00 23
## 4 2016 2.00 16
## 5 2016 3.00 5
## 6 2016 4.00 10
## 7 2016 5.00 8
## 8 2016 6.00 1
## 9 2016 7.00 1
## 10 2016 8.00 0
## # ... with 14 more rows
4. Calculate the maximum, minimum and median temperature values of London for each month and each year.
travel_weather %>%
group_by(year,month) %>%
summarise(London_min=YOURANSWERHERE1,London_median=median(London),London_max=YOURANSWERHERE2)
## # A tibble: 24 x 5
## # Groups: year [?]
## year month London_min London_median London_max
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11.0 1.00 11.0 14.0
## 2 2015 12.0 0 10.0 14.0
## 3 2016 1.00 0 6.00 11.0
## 4 2016 2.00 1.00 4.00 12.0
## 5 2016 3.00 2.00 6.00 11.0
## 6 2016 4.00 4.00 8.00 11.0
## 7 2016 5.00 8.00 13.0 16.0
## 8 2016 6.00 11.0 16.0 19.0
## 9 2016 7.00 14.0 18.0 24.0
## 10 2016 8.00 14.0 18.0 24.0
## # ... with 14 more rows
Here is a showcase of some advanced examples of tidyverse data manipulation power.
Sometimes you want to have the differences between consecutive rows. Then you can use lag
and lead
functions. Suppose we want to calculate the
travel_weather %>%
transmute(year,month,day,Amsterdam,A_prev=lag(Amsterdam),A_next=lead(Amsterdam),
A_prev_diff=Amsterdam-A_prev,A_next_diff=Amsterdam-A_next)
## # A tibble: 731 x 8
## year month day Amsterdam A_prev A_next A_prev_diff A_next_diff
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11.0 1.00 8.00 NA 10.0 NA -2.00
## 2 2015 11.0 2.00 10.0 8.00 9.00 2.00 1.00
## 3 2015 11.0 3.00 9.00 10.0 12.0 - 1.00 -3.00
## 4 2015 11.0 4.00 12.0 9.00 13.0 3.00 -1.00
## 5 2015 11.0 5.00 13.0 12.0 16.0 1.00 -3.00
## 6 2015 11.0 6.00 16.0 13.0 16.0 3.00 0
## 7 2015 11.0 7.00 16.0 16.0 12.0 0 4.00
## 8 2015 11.0 8.00 12.0 16.0 13.0 - 4.00 -1.00
## 9 2015 11.0 9.00 13.0 12.0 14.0 1.00 -1.00
## 10 2015 11.0 10.0 14.0 13.0 13.0 1.00 1.00
## # ... with 721 more rows
slice
Slice function returns the rows with the given indexes.
travel_weather %>%
slice(1:3)
## # A tibble: 3 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11.0 1.00 8.00 8.00 16.0 13.0
## 2 2015 11.0 2.00 10.0 11.0 15.0 10.0
## 3 2015 11.0 3.00 9.00 11.0 16.0 9.00
It can also be combined with the group_by
function.
travel_weather %>%
group_by(year) %>%
slice(1:3)
## # A tibble: 9 x 7
## # Groups: year [3]
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11.0 1.00 8.00 8.00 16.0 13.0
## 2 2015 11.0 2.00 10.0 11.0 15.0 10.0
## 3 2015 11.0 3.00 9.00 11.0 16.0 9.00
## 4 2016 1.00 1.00 4.00 3.00 3.00 2.00
## 5 2016 1.00 2.00 6.00 10.0 2.00 0
## 6 2016 1.00 3.00 7.00 8.00 4.00 3.00
## 7 2017 1.00 1.00 1.00 7.00 7.00 2.00
## 8 2017 1.00 2.00 3.00 2.00 3.00 1.00
## 9 2017 1.00 3.00 4.00 2.00 5.00 3.00
But be careful using the slice function as it only returns rows by the index value.
You might need to transform your data from wide (many columns) to long format (less columnns) or vice versa. They are also called melting and casting. Then you can use gather
and spread
functions respectively. They can be a bit confusing at first but you can quickly get used to them.
Suppose we want to see a summary table of average temperatures of each city for each month. But we want the cities as rows and months as columns.
#Transform to long format by melting the data
#Though you should not include date columns
travel_weather_long <-
travel_weather %>%
gather(key=City,value=Temperature,-year,-month,-day)
travel_weather_long
## # A tibble: 2,924 x 5
## year month day City Temperature
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 2015 11.0 1.00 Amsterdam 8.00
## 2 2015 11.0 2.00 Amsterdam 10.0
## 3 2015 11.0 3.00 Amsterdam 9.00
## 4 2015 11.0 4.00 Amsterdam 12.0
## 5 2015 11.0 5.00 Amsterdam 13.0
## 6 2015 11.0 6.00 Amsterdam 16.0
## 7 2015 11.0 7.00 Amsterdam 16.0
## 8 2015 11.0 8.00 Amsterdam 12.0
## 9 2015 11.0 9.00 Amsterdam 13.0
## 10 2015 11.0 10.0 Amsterdam 14.0
## # ... with 2,914 more rows
#Now group by and summarise to get average temperatures for each city and month
travel_weather_long %>%
group_by(month,City) %>%
summarise(temp_avg=round(mean(Temperature))) %>%
#Now spread the months to the columns
spread(month,temp_avg)
## # A tibble: 4 x 13
## City `1` `2` `3` `4` `5` `6` `7` `8` `9` `10`
## * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Amsterdam 3.00 4.00 7.00 8.00 14.0 17.0 18.0 18.0 16.0 12.0
## 2 London 4.00 6.00 8.00 9.00 13.0 17.0 18.0 18.0 16.0 12.0
## 3 NYC 2.00 4.00 7.00 13.0 17.0 22.0 26.0 25.0 22.0 16.0
## 4 Venice 2.00 7.00 11.0 14.0 18.0 22.0 25.0 25.0 20.0 14.0
## `11` `12`
## * <dbl> <dbl>
## 1 8.00 7.00
## 2 9.00 8.00
## 3 11.0 7.00
## 4 9.00 5.00
Especially mutate
and summarise
has some special functions defined with “all” and “at” (in the previous versions “each”) suffixes.
Let’s get the average temperatures of all cities. We can do it in two ways. First select the cities and use summarise_all
or select cities in summarise_at
.
#Method 1
travel_weather %>%
select(Amsterdam:Venice) %>%
summarise_all(funs(round(mean(.))))
## # A tibble: 1 x 4
## Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl>
## 1 11.0 12.0 14.0 14.0
#Method 2
travel_weather %>%
summarise_at(vars(Amsterdam:Venice),funs(round(mean(.))))
## # A tibble: 1 x 4
## Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl>
## 1 11.0 12.0 14.0 14.0
We can use the mutate_at
function to see all other cities’ temperature differences from NYC.
#Method 2
travel_weather %>%
mutate_at(vars(Amsterdam,London,Venice),funs(diff_NYC=abs(NYC-.))) %>%
select(-Amsterdam,-London,-Venice)
## # A tibble: 731 x 7
## year month day NYC Amsterdam_diff_NYC London_diff_NYC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11.0 1.00 16.0 8.00 8.00
## 2 2015 11.0 2.00 15.0 5.00 4.00
## 3 2015 11.0 3.00 16.0 7.00 5.00
## 4 2015 11.0 4.00 17.0 5.00 6.00
## 5 2015 11.0 5.00 18.0 5.00 5.00
## 6 2015 11.0 6.00 21.0 5.00 7.00
## 7 2015 11.0 7.00 17.0 1.00 3.00
## 8 2015 11.0 8.00 11.0 1.00 1.00
## 9 2015 11.0 9.00 11.0 2.00 1.00
## 10 2015 11.0 10.0 12.0 2.00 2.00
## Venice_diff_NYC
## <dbl>
## 1 3.00
## 2 5.00
## 3 7.00
## 4 7.00
## 5 6.00
## 6 8.00
## 7 3.00
## 8 2.00
## 9 0
## 10 1.00
## # ... with 721 more rows
These exercises are left to the students to test themselves. Try to write the code to replicate the results.
1. Return the dates which Amsterdam is strictly warmer than London but strictly colder than Venice
## # A tibble: 165 x 7
## year month day Amsterdam London NYC Venice
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015 11.0 21.0 5.00 3.00 9.00 8.00
## 2 2015 11.0 22.0 3.00 1.00 9.00 8.00
## 3 2016 1.00 13.0 4.00 3.00 - 3.00 6.00
## 4 2016 1.00 16.0 2.00 1.00 8.00 4.00
## 5 2016 2.00 3.00 5.00 4.00 11.0 8.00
## 6 2016 2.00 11.0 4.00 3.00 - 4.00 7.00
## 7 2016 2.00 12.0 2.00 1.00 - 6.00 6.00
## 8 2016 2.00 23.0 4.00 3.00 3.00 11.0
## 9 2016 2.00 24.0 2.00 1.00 9.00 10.0
## 10 2016 2.00 25.0 2.00 1.00 9.00 8.00
## # ... with 155 more rows
2. For each month of each year calculate the average difference between NYC and Amsterdam for the days NYC is strictly warmer than Amsterdam, rounded by 1 decimal. Arrange from the highest difference to the lowest.
## # A tibble: 24 x 3
## # Groups: year [3]
## year month NYCwA_diff
## <dbl> <dbl> <dbl>
## 1 2016 8.00 8.40
## 2 2016 7.00 8.10
## 3 2017 9.00 7.90
## 4 2016 4.00 7.50
## 5 2017 4.00 7.40
## 6 2017 7.00 7.30
## 7 2017 8.00 6.50
## 8 2016 11.0 6.40
## 9 2016 3.00 6.30
## 10 2016 6.00 6.00
## # ... with 14 more rows
3. Return the warmest city and its temperature of each day.
## # A tibble: 731 x 5
## # Groups: year, month, day [731]
## year month day City Temperature
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 2015 11.0 1.00 NYC 16.0
## 2 2015 11.0 2.00 NYC 15.0
## 3 2015 11.0 3.00 NYC 16.0
## 4 2015 11.0 4.00 NYC 17.0
## 5 2015 11.0 5.00 NYC 18.0
## 6 2015 11.0 6.00 NYC 21.0
## 7 2015 11.0 7.00 NYC 17.0
## 8 2015 11.0 8.00 Venice 13.0
## 9 2015 11.0 9.00 Amsterdam 13.0
## 10 2015 11.0 10.0 Amsterdam 14.0
## # ... with 721 more rows