Assignment 1: Esoph and Young People Survey Data

Part-1: Esoph

For this assignment, we are going to use built-in data set named “esoph”. We call this data by just typing “esoph” to the console.

esoph

##    agegp     alcgp    tobgp ncases ncontrols
## 1  25-34 0-39g/day 0-9g/day      0        40
## 2  25-34 0-39g/day    10-19      0        10
## 3  25-34 0-39g/day    20-29      0         6
## 4  25-34 0-39g/day      30+      0         5
## 5  25-34     40-79 0-9g/day      0        27
## 6  25-34     40-79    10-19      0         7
## 7  25-34     40-79    20-29      0         4
## 8  25-34     40-79      30+      0         7
## 9  25-34    80-119 0-9g/day      0         2
## 10 25-34    80-119    10-19      0         1
## 11 25-34    80-119      30+      0         2
## 12 25-34      120+ 0-9g/day      0         1
## 13 25-34      120+    10-19      1         1
## 14 25-34      120+    20-29      0         1
## 15 25-34      120+      30+      0         2
## 16 35-44 0-39g/day 0-9g/day      0        60
## 17 35-44 0-39g/day    10-19      1        14
## 18 35-44 0-39g/day    20-29      0         7
## 19 35-44 0-39g/day      30+      0         8
## 20 35-44     40-79 0-9g/day      0        35
## 21 35-44     40-79    10-19      3        23
## 22 35-44     40-79    20-29      1        14
## 23 35-44     40-79      30+      0         8
## 24 35-44    80-119 0-9g/day      0        11
## 25 35-44    80-119    10-19      0         6
## 26 35-44    80-119    20-29      0         2
## 27 35-44    80-119      30+      0         1
## 28 35-44      120+ 0-9g/day      2         3
## 29 35-44      120+    10-19      0         3
## 30 35-44      120+    20-29      2         4
## 31 45-54 0-39g/day 0-9g/day      1        46
## 32 45-54 0-39g/day    10-19      0        18
## 33 45-54 0-39g/day    20-29      0        10
## 34 45-54 0-39g/day      30+      0         4
## 35 45-54     40-79 0-9g/day      6        38
## 36 45-54     40-79    10-19      4        21
## 37 45-54     40-79    20-29      5        15
## 38 45-54     40-79      30+      5         7
## 39 45-54    80-119 0-9g/day      3        16
## 40 45-54    80-119    10-19      6        14
## 41 45-54    80-119    20-29      1         5
## 42 45-54    80-119      30+      2         4
## 43 45-54      120+ 0-9g/day      4         4
## 44 45-54      120+    10-19      3         4
## 45 45-54      120+    20-29      2         3
## 46 45-54      120+      30+      4         4
## 47 55-64 0-39g/day 0-9g/day      2        49
## 48 55-64 0-39g/day    10-19      3        22
## 49 55-64 0-39g/day    20-29      3        12
## 50 55-64 0-39g/day      30+      4         6
## 51 55-64     40-79 0-9g/day      9        40
## 52 55-64     40-79    10-19      6        21
## 53 55-64     40-79    20-29      4        17
## 54 55-64     40-79      30+      3         6
## 55 55-64    80-119 0-9g/day      9        18
## 56 55-64    80-119    10-19      8        15
## 57 55-64    80-119    20-29      3         6
## 58 55-64    80-119      30+      4         4
## 59 55-64      120+ 0-9g/day      5        10
## 60 55-64      120+    10-19      6         7
## 61 55-64      120+    20-29      2         3
## 62 55-64      120+      30+      5         6
## 63 65-74 0-39g/day 0-9g/day      5        48
## 64 65-74 0-39g/day    10-19      4        14
## 65 65-74 0-39g/day    20-29      2         7
## 66 65-74 0-39g/day      30+      0         2
## 67 65-74     40-79 0-9g/day     17        34
## 68 65-74     40-79    10-19      3        10
## 69 65-74     40-79    20-29      5         9
## 70 65-74    80-119 0-9g/day      6        13
## 71 65-74    80-119    10-19      4        12
## 72 65-74    80-119    20-29      2         3
## 73 65-74    80-119      30+      1         1
## 74 65-74      120+ 0-9g/day      3         4
## 75 65-74      120+    10-19      1         2
## 76 65-74      120+    20-29      1         1
## 77 65-74      120+      30+      1         1
## 78   75+ 0-39g/day 0-9g/day      1        18
## 79   75+ 0-39g/day    10-19      2         6
## 80   75+ 0-39g/day      30+      1         3
## 81   75+     40-79 0-9g/day      2         5
## 82   75+     40-79    10-19      1         3
## 83   75+     40-79    20-29      0         3
## 84   75+     40-79      30+      1         1
## 85   75+    80-119 0-9g/day      1         1
## 86   75+    80-119    10-19      1         1
## 87   75+      120+ 0-9g/day      2         2
## 88   75+      120+    10-19      1         1

We can also see a summary of this data with the “summary” function:

summary(esoph)

##    agegp          alcgp         tobgp        ncases         ncontrols    
##  25-34:15   0-39g/day:23   0-9g/day:24   Min.   : 0.000   Min.   : 1.00  
##  35-44:15   40-79    :23   10-19   :24   1st Qu.: 0.000   1st Qu.: 3.00  
##  45-54:16   80-119   :21   20-29   :20   Median : 1.000   Median : 6.00  
##  55-64:16   120+     :21   30+     :20   Mean   : 2.273   Mean   :11.08  
##  65-74:15                                3rd Qu.: 4.000   3rd Qu.:14.00  
##  75+  :11                                Max.   :17.000   Max.   :60.00

This is the data that shows esophageal cancer rates depending on age, alcohol consumption and tobacco consumption. In this assignment, we want to see how each of these parameters effect the number of cancer cases.

The best way to do is to plot some graphs that show number of cases as a percentage of total number of control patients as per the data we want.

To do this, we need to group and summarize our data. It can be seen that the data itself is already grouped by the agegp column, so we can start from the effect of age.

Let’s load necessary packages:

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)

Then we summarise the data and in “summarise” function we need to show the math of percentage:

age_effect <- esoph %>% group_by(agegp) %>%
    summarise(tot_rows = n(), people_cancer = sum(ncases), people_total = sum(ncontrols),
              percentage=people_cancer*100/people_total)

In this code chunk, we summarise esoph data such that, tot_rows is the count of total rows in dataset, people_cancer is the total number of cases with cancer, and people_total is the overall control case number. We get the percentage with the formula. Now we need to plot the graph according to each age group:

#Create the graph with ggplot
ageplot <- ggplot(age_effect, aes(x=age_effect$agegp, y=age_effect$percentage, fill=agegp)) + geom_bar(stat = "identity")

#Make the labelings and remove the legend which is not necessary:
ageplot + labs(x="Age Groups",y="Percentage of Cancer cases", title= "Cancer Cases as per Age Groups") + theme(legend.position = "none")

From this plot, we can observe that cancer risk is higher in older people, as most cancer cases are seen in age groups of 65-74, 55-64 and 75+.

Now, we can do the same application for alcohol consumption and tobacco consumption with small modifications:

alcohol_effect <- esoph %>% group_by(alcgp) %>%
    summarise(tot_rows = n(), people_cancer = sum(ncases), people_total = sum(ncontrols),
              percentage=people_cancer*100/people_total)

alcoholplot <- ggplot(alcohol_effect, aes(x=alcohol_effect$alcgp, y=alcohol_effect$percentage, fill=alcgp)) + geom_bar(stat = "identity")

alcoholplot + labs(x="Alcohol Consumption",y="Percentage of Cancer cases", title= "Cancer Cases as per Alcohol consumption") + theme(legend.position = "none")

tobacco_effect <- esoph %>% group_by(tobgp) %>%
    summarise(tot_rows = n(), people_cancer = sum(ncases), people_total = sum(ncontrols),
              percentage=people_cancer*100/people_total)

tobaccoplot <- ggplot(tobacco_effect, aes(x=tobacco_effect$tobgp, y=tobacco_effect$percentage, fill=tobgp)) + geom_bar(stat = "identity")

tobaccoplot + labs(x="Tobacco Consumption",y="Percentage of Cancer cases", title= "Cancer Cases as per Tobacco consumption") + theme(legend.position = "none")

By looking at these bar graphs, the negative effect of both alcohol and tobacco consumption on people’s health can be clearly seen. Cancer percentage increases with an increase in both tobacco consumption and alcohol consumption.

Second Alternative Approach:

While searching on the internet, I have found an alternative way of approaching the esoph data, which I want to share here directly, taken from “https://rdrr.io/r/datasets/esoph.html”

This approach gives the output as a mosaic plot, which is a more complex multi dimensional plot. It shows more detailed information regarding correlations, however I found it more difficult to comprehend.

require(stats)
require(graphics) # for mosaicplot
summary(esoph)

##    agegp          alcgp         tobgp        ncases         ncontrols    
##  25-34:15   0-39g/day:23   0-9g/day:24   Min.   : 0.000   Min.   : 1.00  
##  35-44:15   40-79    :23   10-19   :24   1st Qu.: 0.000   1st Qu.: 3.00  
##  45-54:16   80-119   :21   20-29   :20   Median : 1.000   Median : 6.00  
##  55-64:16   120+     :21   30+     :20   Mean   : 2.273   Mean   :11.08  
##  65-74:15                                3rd Qu.: 4.000   3rd Qu.:14.00  
##  75+  :11                                Max.   :17.000   Max.   :60.00

## effects of alcohol, tobacco and interaction, age-adjusted
model1 <- glm(cbind(ncases, ncontrols) ~ agegp + tobgp * alcgp,
              data = esoph, family = binomial())
anova(model1)

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: cbind(ncases, ncontrols)
## 
## Terms added sequentially (first to last)
## 
## 
##             Df Deviance Resid. Df Resid. Dev
## NULL                           87    227.241
## agegp        5   88.128        82    139.112
## tobgp        3   19.085        79    120.028
## alcgp        3   66.054        76     53.973
## tobgp:alcgp  9    6.489        67     47.484

## Try a linear effect of alcohol and tobacco
model2 <- glm(cbind(ncases, ncontrols) ~ agegp + unclass(tobgp)
                                         + unclass(alcgp),
              data = esoph, family = binomial())
summary(model2)

## 
## Call:
## glm(formula = cbind(ncases, ncontrols) ~ agegp + unclass(tobgp) + 
##     unclass(alcgp), family = binomial(), data = esoph)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7628  -0.6426  -0.2709   0.3043   2.0421  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -4.01097    0.31224 -12.846  < 2e-16 ***
## agegp.L         2.96113    0.65092   4.549 5.39e-06 ***
## agegp.Q        -1.33735    0.58918  -2.270  0.02322 *  
## agegp.C         0.15292    0.44792   0.341  0.73281    
## agegp^4         0.06668    0.30776   0.217  0.82848    
## agegp^5        -0.20288    0.19523  -1.039  0.29872    
## unclass(tobgp)  0.26162    0.08198   3.191  0.00142 ** 
## unclass(alcgp)  0.65308    0.08452   7.727 1.10e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 227.241  on 87  degrees of freedom
## Residual deviance:  59.277  on 80  degrees of freedom
## AIC: 222.76
## 
## Number of Fisher Scoring iterations: 6

## Re-arrange data for a mosaic plot
ttt <- table(esoph$agegp, esoph$alcgp, esoph$tobgp)
o <- with(esoph, order(tobgp, alcgp, agegp))
ttt[ttt == 1] <- esoph$ncases[o]
tt1 <- table(esoph$agegp, esoph$alcgp, esoph$tobgp)
tt1[tt1 == 1] <- esoph$ncontrols[o]
tt <- array(c(ttt, tt1), c(dim(ttt),2),
            c(dimnames(ttt), list(c("Cancer", "control"))))
mosaicplot(tt, main = "esoph data set", color = TRUE)

Part-2: Young People Survey Data

For this part, we are going to use the dataset “youth_responses.csv”

First, we import the data:

library(tidyverse)
library(readr)
youth_responses <- read_csv("youth_responses.csv")

## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   Smoking = col_character(),
##   Alcohol = col_character(),
##   Punctuality = col_character(),
##   Lying = col_character(),
##   `Internet usage` = col_character(),
##   Gender = col_character(),
##   `Left - right handed` = col_character(),
##   Education = col_character(),
##   `Only child` = col_character(),
##   `Village - town` = col_character(),
##   `House - block of flats` = col_character()
## )

## See spec(...) for full column specifications.

print(youth_responses)

## # A tibble: 1,010 x 150
##    Music `Slow songs or fast… Dance  Folk Country `Classical musi… Musical
##    <int>                <int> <int> <int>   <int>            <int>   <int>
##  1     5                    3     2     1       2                2       1
##  2     4                    4     2     1       1                1       2
##  3     5                    5     2     2       3                4       5
##  4     5                    3     2     1       1                1       1
##  5     5                    3     4     3       2                4       3
##  6     5                    3     2     3       2                3       3
##  7     5                    5     5     3       1                2       2
##  8     5                    3     3     2       1                2       2
##  9     5                    3     3     1       1                2       4
## 10     5                    3     2     5       2                2       5
## # ... with 1,000 more rows, and 143 more variables: Pop <int>, Rock <int>,
## #   `Metal or Hardrock` <int>, Punk <int>, `Hiphop, Rap` <int>, `Reggae,
## #   Ska` <int>, `Swing, Jazz` <int>, `Rock n roll` <int>,
## #   Alternative <int>, Latino <int>, `Techno, Trance` <int>, Opera <int>,
## #   Movies <int>, Horror <int>, Thriller <int>, Comedy <int>,
## #   Romantic <int>, `Sci-fi` <int>, War <int>, `Fantasy/Fairy
## #   tales` <int>, Animated <int>, Documentary <int>, Western <int>,
## #   Action <int>, History <int>, Psychology <int>, Politics <int>,
## #   Mathematics <int>, Physics <int>, Internet <int>, PC <int>, `Economy
## #   Management` <int>, Biology <int>, Chemistry <int>, Reading <int>,
## #   Geography <int>, `Foreign languages` <int>, Medicine <int>, Law <int>,
## #   Cars <int>, `Art exhibitions` <int>, Religion <int>, `Countryside,
## #   outdoors` <int>, Dancing <int>, `Musical instruments` <int>,
## #   Writing <int>, `Passive sport` <int>, `Active sport` <int>,
## #   Gardening <int>, Celebrities <int>, Shopping <int>, `Science and
## #   technology` <int>, Theatre <int>, `Fun with friends` <int>,
## #   `Adrenaline sports` <int>, Pets <int>, Flying <int>, Storm <int>,
## #   Darkness <int>, Heights <int>, Spiders <int>, Snakes <int>,
## #   Rats <int>, Ageing <int>, `Dangerous dogs` <int>, `Fear of public
## #   speaking` <int>, Smoking <chr>, Alcohol <chr>, `Healthy eating` <int>,
## #   `Daily events` <int>, `Prioritising workload` <int>, `Writing
## #   notes` <int>, Workaholism <int>, `Thinking ahead` <int>, `Final
## #   judgement` <int>, Reliability <int>, `Keeping promises` <int>, `Loss
## #   of interest` <int>, `Friends versus money` <int>, Funniness <int>,
## #   Fake <int>, `Criminal damage` <int>, `Decision making` <int>,
## #   Elections <int>, `Self-criticism` <int>, `Judgment calls` <int>,
## #   Hypochondria <int>, Empathy <int>, `Eating to survive` <int>,
## #   Giving <int>, `Compassion to animals` <int>, `Borrowed stuff` <int>,
## #   Loneliness <int>, `Cheating in school` <int>, Health <int>, `Changing
## #   the past` <int>, God <int>, Dreams <int>, Charity <int>, `Number of
## #   friends` <int>, …

We run principal component analysis:

yr_data <-
youth_responses %>%
filter(complete.cases(.)) %>%
tbl_df()
yr_pca<-
yr_data[,sapply(yr_data,class)=="integer"] %>%
select(Music:Age)
yr_pca_result<-princomp(yr_pca,cor=T)
ggplot(data=data.frame(PC=1:length(yr_pca_result$sdev),
var_exp=cumsum(yr_pca_result$sdev^2/sum(yr_pca_result$sdev^2))),
aes(x=PC,y=var_exp)) + geom_line() +
geom_point() +
scale_y_continuous(labels = scales::percent,breaks=seq(0,1,length.out=11)) +
scale_x_continuous(breaks=seq(0,135,by=5))

It can be observed that, approximately 90 principle components, we can explain 90% of the variation.

Then, we run multi-dimensional scaling for the data. I will run for the columns between “Movies” and “Action”:

mds_data <- yr_pca %>% select(Movies:Action)
print(head(as.data.frame(mds_data)))

##   Movies Horror Thriller Comedy Romantic Sci-fi War Fantasy/Fairy tales
## 1      5      4        2      5        4      4   1                   5
## 2      5      2        2      4        3      4   1                   3
## 3      5      3        4      4        2      4   2                   5
## 4      5      4        4      5        2      3   3                   4
## 5      5      5        5      5        2      3   3                   4
## 6      4      2        1      5        3      1   3                   5
##   Animated Documentary Western Action
## 1        5           3       1      2
## 2        5           4       1      4
## 3        5           2       2      1
## 4        4           3       1      4
## 5        3           3       2      4
## 6        5           3       1      2

yr_dist <- 1 - cor(mds_data)
yr_mds <- cmdscale(yr_dist,k=2)
colnames(yr_mds) <- c("x","y")
print(yr_mds)

##                               x           y
## Movies              -0.06220315  0.22244809
## Horror               0.22338890  0.44970946
## Thriller             0.34619908  0.37953838
## Comedy              -0.32366415  0.19671062
## Romantic            -0.63949811  0.12349794
## Sci-fi               0.26428812 -0.03621015
## War                  0.41480453 -0.15520774
## Fantasy/Fairy tales -0.55654956 -0.18611282
## Animated            -0.48175848 -0.15618915
## Documentary          0.10852095 -0.47845109
## Western              0.35783918 -0.31639985
## Action               0.34863267 -0.04333368

ggplot(data.frame(yr_mds),aes(x=x,y=y)) + geom_text(label=rownames(yr_mds),angle=45,size=4)

As per the MDS, we can see that Sci-fi, Action, War and Western movies are highly correlated. Similarly, young people who like horror movies are also seen to like thrillers. Another similar correlation is observed between animated movies and fantasy/fairy tales.

ML Part-1: Esoph and Young People Survey Data

Alp Biberoğlu

02 05 2018

Assignment 1: Esoph and Young People Survey Data

Part-1: Esoph

Second Alternative Approach:

Part-2: Young People Survey Data