This homework consists of 3 Assignments:
Assignment 1: Esoph and Young People Survey Data
Assignment 2: Diamonds Data
Assignment 3: Spam Data
For this assignment, we are going to use built-in data set named “esoph”. We call this data by just typing “esoph” to the console.
esoph
## agegp alcgp tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day 0 40
## 2 25-34 0-39g/day 10-19 0 10
## 3 25-34 0-39g/day 20-29 0 6
## 4 25-34 0-39g/day 30+ 0 5
## 5 25-34 40-79 0-9g/day 0 27
## 6 25-34 40-79 10-19 0 7
## 7 25-34 40-79 20-29 0 4
## 8 25-34 40-79 30+ 0 7
## 9 25-34 80-119 0-9g/day 0 2
## 10 25-34 80-119 10-19 0 1
## 11 25-34 80-119 30+ 0 2
## 12 25-34 120+ 0-9g/day 0 1
## 13 25-34 120+ 10-19 1 1
## 14 25-34 120+ 20-29 0 1
## 15 25-34 120+ 30+ 0 2
## 16 35-44 0-39g/day 0-9g/day 0 60
## 17 35-44 0-39g/day 10-19 1 14
## 18 35-44 0-39g/day 20-29 0 7
## 19 35-44 0-39g/day 30+ 0 8
## 20 35-44 40-79 0-9g/day 0 35
## 21 35-44 40-79 10-19 3 23
## 22 35-44 40-79 20-29 1 14
## 23 35-44 40-79 30+ 0 8
## 24 35-44 80-119 0-9g/day 0 11
## 25 35-44 80-119 10-19 0 6
## 26 35-44 80-119 20-29 0 2
## 27 35-44 80-119 30+ 0 1
## 28 35-44 120+ 0-9g/day 2 3
## 29 35-44 120+ 10-19 0 3
## 30 35-44 120+ 20-29 2 4
## 31 45-54 0-39g/day 0-9g/day 1 46
## 32 45-54 0-39g/day 10-19 0 18
## 33 45-54 0-39g/day 20-29 0 10
## 34 45-54 0-39g/day 30+ 0 4
## 35 45-54 40-79 0-9g/day 6 38
## 36 45-54 40-79 10-19 4 21
## 37 45-54 40-79 20-29 5 15
## 38 45-54 40-79 30+ 5 7
## 39 45-54 80-119 0-9g/day 3 16
## 40 45-54 80-119 10-19 6 14
## 41 45-54 80-119 20-29 1 5
## 42 45-54 80-119 30+ 2 4
## 43 45-54 120+ 0-9g/day 4 4
## 44 45-54 120+ 10-19 3 4
## 45 45-54 120+ 20-29 2 3
## 46 45-54 120+ 30+ 4 4
## 47 55-64 0-39g/day 0-9g/day 2 49
## 48 55-64 0-39g/day 10-19 3 22
## 49 55-64 0-39g/day 20-29 3 12
## 50 55-64 0-39g/day 30+ 4 6
## 51 55-64 40-79 0-9g/day 9 40
## 52 55-64 40-79 10-19 6 21
## 53 55-64 40-79 20-29 4 17
## 54 55-64 40-79 30+ 3 6
## 55 55-64 80-119 0-9g/day 9 18
## 56 55-64 80-119 10-19 8 15
## 57 55-64 80-119 20-29 3 6
## 58 55-64 80-119 30+ 4 4
## 59 55-64 120+ 0-9g/day 5 10
## 60 55-64 120+ 10-19 6 7
## 61 55-64 120+ 20-29 2 3
## 62 55-64 120+ 30+ 5 6
## 63 65-74 0-39g/day 0-9g/day 5 48
## 64 65-74 0-39g/day 10-19 4 14
## 65 65-74 0-39g/day 20-29 2 7
## 66 65-74 0-39g/day 30+ 0 2
## 67 65-74 40-79 0-9g/day 17 34
## 68 65-74 40-79 10-19 3 10
## 69 65-74 40-79 20-29 5 9
## 70 65-74 80-119 0-9g/day 6 13
## 71 65-74 80-119 10-19 4 12
## 72 65-74 80-119 20-29 2 3
## 73 65-74 80-119 30+ 1 1
## 74 65-74 120+ 0-9g/day 3 4
## 75 65-74 120+ 10-19 1 2
## 76 65-74 120+ 20-29 1 1
## 77 65-74 120+ 30+ 1 1
## 78 75+ 0-39g/day 0-9g/day 1 18
## 79 75+ 0-39g/day 10-19 2 6
## 80 75+ 0-39g/day 30+ 1 3
## 81 75+ 40-79 0-9g/day 2 5
## 82 75+ 40-79 10-19 1 3
## 83 75+ 40-79 20-29 0 3
## 84 75+ 40-79 30+ 1 1
## 85 75+ 80-119 0-9g/day 1 1
## 86 75+ 80-119 10-19 1 1
## 87 75+ 120+ 0-9g/day 2 2
## 88 75+ 120+ 10-19 1 1
We can also see a summary of this data with the “summary” function:
summary(esoph)
## agegp alcgp tobgp ncases ncontrols
## 25-34:15 0-39g/day:23 0-9g/day:24 Min. : 0.000 Min. : 1.00
## 35-44:15 40-79 :23 10-19 :24 1st Qu.: 0.000 1st Qu.: 3.00
## 45-54:16 80-119 :21 20-29 :20 Median : 1.000 Median : 6.00
## 55-64:16 120+ :21 30+ :20 Mean : 2.273 Mean :11.08
## 65-74:15 3rd Qu.: 4.000 3rd Qu.:14.00
## 75+ :11 Max. :17.000 Max. :60.00
This is the data that shows esophageal cancer rates depending on age, alcohol consumption and tobacco consumption. In this assignment, we want to see how each of these parameters effect the number of cancer cases.
The best way to do is to plot some graphs that show number of cases as a percentage of total number of control patients as per the data we want.
To do this, we need to group and summarize our data. It can be seen that the data itself is already grouped by the agegp column, so we can start from the effect of age.
Let’s load necessary packages:
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.3.0
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
Then we summarise the data and in “summarise” function we need to show the math of percentage:
age_effect <- esoph %>% group_by(agegp) %>%
summarise(tot_rows = n(), people_cancer = sum(ncases), people_total = sum(ncontrols),
percentage=people_cancer*100/people_total)
In this code chunk, we summarise esoph data such that, tot_rows is the count of total rows in dataset, people_cancer is the total number of cases with cancer, and people_total is the overall control case number. We get the percentage with the formula. Now we need to plot the graph according to each age group:
#Create the graph with ggplot
ageplot <- ggplot(age_effect, aes(x=age_effect$agegp, y=age_effect$percentage, fill=agegp)) + geom_bar(stat = "identity")
#Make the labelings and remove the legend which is not necessary:
ageplot + labs(x="Age Groups",y="Percentage of Cancer cases", title= "Cancer Cases as per Age Groups") + theme(legend.position = "none")
From this plot, we can observe that cancer risk is higher in older people, as most cancer cases are seen in age groups of 65-74, 55-64 and 75+.
Now, we can do the same application for alcohol consumption and tobacco consumption with small modifications:
alcohol_effect <- esoph %>% group_by(alcgp) %>%
summarise(tot_rows = n(), people_cancer = sum(ncases), people_total = sum(ncontrols),
percentage=people_cancer*100/people_total)
alcoholplot <- ggplot(alcohol_effect, aes(x=alcohol_effect$alcgp, y=alcohol_effect$percentage, fill=alcgp)) + geom_bar(stat = "identity")
alcoholplot + labs(x="Alcohol Consumption",y="Percentage of Cancer cases", title= "Cancer Cases as per Alcohol consumption") + theme(legend.position = "none")
tobacco_effect <- esoph %>% group_by(tobgp) %>%
summarise(tot_rows = n(), people_cancer = sum(ncases), people_total = sum(ncontrols),
percentage=people_cancer*100/people_total)
tobaccoplot <- ggplot(tobacco_effect, aes(x=tobacco_effect$tobgp, y=tobacco_effect$percentage, fill=tobgp)) + geom_bar(stat = "identity")
tobaccoplot + labs(x="Tobacco Consumption",y="Percentage of Cancer cases", title= "Cancer Cases as per Tobacco consumption") + theme(legend.position = "none")
By looking at these bar graphs, the negative effect of both alcohol and tobacco consumption on people’s health can be clearly seen. Cancer percentage increases with an increase in both tobacco consumption and alcohol consumption.
While searching on the internet, I have found an alternative way of approaching the esoph data, which I want to share here directly, taken from “https://rdrr.io/r/datasets/esoph.html”
This approach gives the output as a mosaic plot, which is a more complex multi dimensional plot. It shows more detailed information regarding correlations, however I found it more difficult to comprehend.
require(stats)
require(graphics) # for mosaicplot
summary(esoph)
## agegp alcgp tobgp ncases ncontrols
## 25-34:15 0-39g/day:23 0-9g/day:24 Min. : 0.000 Min. : 1.00
## 35-44:15 40-79 :23 10-19 :24 1st Qu.: 0.000 1st Qu.: 3.00
## 45-54:16 80-119 :21 20-29 :20 Median : 1.000 Median : 6.00
## 55-64:16 120+ :21 30+ :20 Mean : 2.273 Mean :11.08
## 65-74:15 3rd Qu.: 4.000 3rd Qu.:14.00
## 75+ :11 Max. :17.000 Max. :60.00
## effects of alcohol, tobacco and interaction, age-adjusted
model1 <- glm(cbind(ncases, ncontrols) ~ agegp + tobgp * alcgp,
data = esoph, family = binomial())
anova(model1)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: cbind(ncases, ncontrols)
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 87 227.241
## agegp 5 88.128 82 139.112
## tobgp 3 19.085 79 120.028
## alcgp 3 66.054 76 53.973
## tobgp:alcgp 9 6.489 67 47.484
## Try a linear effect of alcohol and tobacco
model2 <- glm(cbind(ncases, ncontrols) ~ agegp + unclass(tobgp)
+ unclass(alcgp),
data = esoph, family = binomial())
summary(model2)
##
## Call:
## glm(formula = cbind(ncases, ncontrols) ~ agegp + unclass(tobgp) +
## unclass(alcgp), family = binomial(), data = esoph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7628 -0.6426 -0.2709 0.3043 2.0421
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.01097 0.31224 -12.846 < 2e-16 ***
## agegp.L 2.96113 0.65092 4.549 5.39e-06 ***
## agegp.Q -1.33735 0.58918 -2.270 0.02322 *
## agegp.C 0.15292 0.44792 0.341 0.73281
## agegp^4 0.06668 0.30776 0.217 0.82848
## agegp^5 -0.20288 0.19523 -1.039 0.29872
## unclass(tobgp) 0.26162 0.08198 3.191 0.00142 **
## unclass(alcgp) 0.65308 0.08452 7.727 1.10e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 227.241 on 87 degrees of freedom
## Residual deviance: 59.277 on 80 degrees of freedom
## AIC: 222.76
##
## Number of Fisher Scoring iterations: 6
## Re-arrange data for a mosaic plot
ttt <- table(esoph$agegp, esoph$alcgp, esoph$tobgp)
o <- with(esoph, order(tobgp, alcgp, agegp))
ttt[ttt == 1] <- esoph$ncases[o]
tt1 <- table(esoph$agegp, esoph$alcgp, esoph$tobgp)
tt1[tt1 == 1] <- esoph$ncontrols[o]
tt <- array(c(ttt, tt1), c(dim(ttt),2),
c(dimnames(ttt), list(c("Cancer", "control"))))
mosaicplot(tt, main = "esoph data set", color = TRUE)
For this part, we are going to use the dataset “youth_responses.csv”
First, we import the data:
library(tidyverse)
library(readr)
youth_responses <- read_csv("youth_responses.csv")
## Parsed with column specification:
## cols(
## .default = col_integer(),
## Smoking = col_character(),
## Alcohol = col_character(),
## Punctuality = col_character(),
## Lying = col_character(),
## `Internet usage` = col_character(),
## Gender = col_character(),
## `Left - right handed` = col_character(),
## Education = col_character(),
## `Only child` = col_character(),
## `Village - town` = col_character(),
## `House - block of flats` = col_character()
## )
## See spec(...) for full column specifications.
print(youth_responses)
## # A tibble: 1,010 x 150
## Music `Slow songs or fast… Dance Folk Country `Classical musi… Musical
## <int> <int> <int> <int> <int> <int> <int>
## 1 5 3 2 1 2 2 1
## 2 4 4 2 1 1 1 2
## 3 5 5 2 2 3 4 5
## 4 5 3 2 1 1 1 1
## 5 5 3 4 3 2 4 3
## 6 5 3 2 3 2 3 3
## 7 5 5 5 3 1 2 2
## 8 5 3 3 2 1 2 2
## 9 5 3 3 1 1 2 4
## 10 5 3 2 5 2 2 5
## # ... with 1,000 more rows, and 143 more variables: Pop <int>, Rock <int>,
## # `Metal or Hardrock` <int>, Punk <int>, `Hiphop, Rap` <int>, `Reggae,
## # Ska` <int>, `Swing, Jazz` <int>, `Rock n roll` <int>,
## # Alternative <int>, Latino <int>, `Techno, Trance` <int>, Opera <int>,
## # Movies <int>, Horror <int>, Thriller <int>, Comedy <int>,
## # Romantic <int>, `Sci-fi` <int>, War <int>, `Fantasy/Fairy
## # tales` <int>, Animated <int>, Documentary <int>, Western <int>,
## # Action <int>, History <int>, Psychology <int>, Politics <int>,
## # Mathematics <int>, Physics <int>, Internet <int>, PC <int>, `Economy
## # Management` <int>, Biology <int>, Chemistry <int>, Reading <int>,
## # Geography <int>, `Foreign languages` <int>, Medicine <int>, Law <int>,
## # Cars <int>, `Art exhibitions` <int>, Religion <int>, `Countryside,
## # outdoors` <int>, Dancing <int>, `Musical instruments` <int>,
## # Writing <int>, `Passive sport` <int>, `Active sport` <int>,
## # Gardening <int>, Celebrities <int>, Shopping <int>, `Science and
## # technology` <int>, Theatre <int>, `Fun with friends` <int>,
## # `Adrenaline sports` <int>, Pets <int>, Flying <int>, Storm <int>,
## # Darkness <int>, Heights <int>, Spiders <int>, Snakes <int>,
## # Rats <int>, Ageing <int>, `Dangerous dogs` <int>, `Fear of public
## # speaking` <int>, Smoking <chr>, Alcohol <chr>, `Healthy eating` <int>,
## # `Daily events` <int>, `Prioritising workload` <int>, `Writing
## # notes` <int>, Workaholism <int>, `Thinking ahead` <int>, `Final
## # judgement` <int>, Reliability <int>, `Keeping promises` <int>, `Loss
## # of interest` <int>, `Friends versus money` <int>, Funniness <int>,
## # Fake <int>, `Criminal damage` <int>, `Decision making` <int>,
## # Elections <int>, `Self-criticism` <int>, `Judgment calls` <int>,
## # Hypochondria <int>, Empathy <int>, `Eating to survive` <int>,
## # Giving <int>, `Compassion to animals` <int>, `Borrowed stuff` <int>,
## # Loneliness <int>, `Cheating in school` <int>, Health <int>, `Changing
## # the past` <int>, God <int>, Dreams <int>, Charity <int>, `Number of
## # friends` <int>, …
We run principal component analysis:
yr_data <-
youth_responses %>%
filter(complete.cases(.)) %>%
tbl_df()
yr_pca<-
yr_data[,sapply(yr_data,class)=="integer"] %>%
select(Music:Age)
yr_pca_result<-princomp(yr_pca,cor=T)
ggplot(data=data.frame(PC=1:length(yr_pca_result$sdev),
var_exp=cumsum(yr_pca_result$sdev^2/sum(yr_pca_result$sdev^2))),
aes(x=PC,y=var_exp)) + geom_line() +
geom_point() +
scale_y_continuous(labels = scales::percent,breaks=seq(0,1,length.out=11)) +
scale_x_continuous(breaks=seq(0,135,by=5))
It can be observed that, approximately 90 principle components, we can explain 90% of the variation.
Then, we run multi-dimensional scaling for the data. I will run for the columns between “Movies” and “Action”:
mds_data <- yr_pca %>% select(Movies:Action)
print(head(as.data.frame(mds_data)))
## Movies Horror Thriller Comedy Romantic Sci-fi War Fantasy/Fairy tales
## 1 5 4 2 5 4 4 1 5
## 2 5 2 2 4 3 4 1 3
## 3 5 3 4 4 2 4 2 5
## 4 5 4 4 5 2 3 3 4
## 5 5 5 5 5 2 3 3 4
## 6 4 2 1 5 3 1 3 5
## Animated Documentary Western Action
## 1 5 3 1 2
## 2 5 4 1 4
## 3 5 2 2 1
## 4 4 3 1 4
## 5 3 3 2 4
## 6 5 3 1 2
yr_dist <- 1 - cor(mds_data)
yr_mds <- cmdscale(yr_dist,k=2)
colnames(yr_mds) <- c("x","y")
print(yr_mds)
## x y
## Movies -0.06220315 0.22244809
## Horror 0.22338890 0.44970946
## Thriller 0.34619908 0.37953838
## Comedy -0.32366415 0.19671062
## Romantic -0.63949811 0.12349794
## Sci-fi 0.26428812 -0.03621015
## War 0.41480453 -0.15520774
## Fantasy/Fairy tales -0.55654956 -0.18611282
## Animated -0.48175848 -0.15618915
## Documentary 0.10852095 -0.47845109
## Western 0.35783918 -0.31639985
## Action 0.34863267 -0.04333368
ggplot(data.frame(yr_mds),aes(x=x,y=y)) + geom_text(label=rownames(yr_mds),angle=45,size=4)
As per the MDS, we can see that Sci-fi, Action, War and Western movies are highly correlated. Similarly, young people who like horror movies are also seen to like thrillers. Another similar correlation is observed between animated movies and fantasy/fairy tales.