Assignment 1

Esoph

First, we build a logistic regression model to see whether our variables significant or not for cancer.

library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------ tidyverse 1.2.1 --
## <U+221A> ggplot2 2.2.1     <U+221A> purrr   0.2.4
## <U+221A> tibble  1.4.2     <U+221A> dplyr   0.7.4
## <U+221A> tidyr   0.8.0     <U+221A> stringr 1.2.0
## <U+221A> readr   1.1.1     <U+221A> forcats 0.2.0
## -- Conflicts --------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(stats)
library(ggplot2)
cancer_model <- glm(cbind(ncases,ncontrols) ~ agegp + tobgp + alcgp , data = esoph , family = binomial() )
anova(cancer_model)
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: cbind(ncases, ncontrols)
## 
## Terms added sequentially (first to last)
## 
## 
##       Df Deviance Resid. Df Resid. Dev
## NULL                     87    227.241
## agegp  5   88.128        82    139.112
## tobgp  3   19.085        79    120.028
## alcgp  3   66.054        76     53.973

We can clearly see age is key variable for cancer,since adding “Age” predictor causes a decrease on Residual Deviance by 88.128 . Then, alcohol consumption predictor helps our model fit better, again we can observe this by looking deviance residual. Finally, tobacco consumption also has effect on cancer which is smaller by comprasion to others. All these variables helps us to build a good model with better fit.

If we want to seek dependent effect of tobacco and alcohol, we should build a model loking at intersection effect of alcohol and tobacco consumption.

cancer_model <- glm(cbind(ncases,ncontrols) ~ agegp + tobgp * alcgp , data = esoph , family = binomial() )
anova(cancer_model)
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: cbind(ncases, ncontrols)
## 
## Terms added sequentially (first to last)
## 
## 
##             Df Deviance Resid. Df Resid. Dev
## NULL                           87    227.241
## agegp        5   88.128        82    139.112
## tobgp        3   19.085        79    120.028
## alcgp        3   66.054        76     53.973
## tobgp:alcgp  9    6.489        67     47.484

We can observe small decrease on residual deviance with adding intersection effect of drinking and smoking. We can claim that usage of alcohol and tobacco together increasing probability of being cancer.

Reference: http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/esoph.html

Young People Survey’s Hobbies & Interests

I applied K-means clustering method with 5 cluster center points to see which hobbies are in same cluster and how people are interested in different hobbies.

yr_data <-
read.csv("/Users/TCTAGUMUS/Documents/R işleri/youth_responses.csv",sep=",") %>%
filter(complete.cases(.)) %>%
# mutate(id=row_number()) %>%
tbl_df()
yr_pca<-
yr_data[,sapply(yr_data,class)=="integer"] %>%
select(History:Pets)
yr_mds_data <- yr_pca %>% select(History:Pets)
yr_dist <- 1 - cor(yr_mds_data)
#Apply MDS
yr_mds <- cmdscale(yr_dist,k=2)
colnames(yr_mds) <- c("x","y")
library(ggplot2)
#Set the seed because K-Means algorithm uses a search based method
set.seed(58)
#Apply K-Means
genre_cluster<-kmeans(yr_mds,centers=6)
##Get the clusters
mds_clusters<-
data.frame(genre=names(genre_cluster$cluster),
cluster_mds=genre_cluster$cluster) %>% arrange(cluster_mds,genre)
mds_clusters
##                     genre cluster_mds
## 1         Art.exhibitions           1
## 2                 Dancing           1
## 3               Gardening           1
## 4              Psychology           1
## 5                 Reading           1
## 6                 Theatre           1
## 7                 Writing           1
## 8             Mathematics           2
## 9                 Physics           2
## 10 Science.and.technology           2
## 11            Celebrities           3
## 12      Foreign.languages           3
## 13       Fun.with.friends           3
## 14                   Pets           3
## 15               Shopping           3
## 16                   Cars           4
## 17               Internet           4
## 18                     PC           4
## 19           Active.sport           5
## 20      Adrenaline.sports           5
## 21     Economy.Management           5
## 22              Geography           5
## 23                    Law           5
## 24          Passive.sport           5
## 25               Politics           5
## 26                Biology           6
## 27              Chemistry           6
## 28  Countryside..outdoors           6
## 29                History           6
## 30               Medicine           6
## 31    Musical.instruments           6
## 32               Religion           6
ggplot(
    data.frame(yr_mds) %>% mutate(clusters=as.factor(genre_cluster$cluster),
    genres=rownames(yr_mds)),aes(x=x,y=y)) +
    geom_text(aes(label=genres,color=clusters),angle=45,size=2) +
    geom_point(data=as.data.frame(genre_cluster$centers),aes(x=x,y=y)
)

From clustering table and plot, we can observe how respondants’ hobbies correlated. We can see mathematics, physics and science in 6th cluster, chemistry, medicine and biology in 2nd cluster and so on. That sounds logical. By appliyng this algorithm , we can guess which hobbies may people interest in by knowing some of their other hobbes.For example, if a young individual enjoy politics, we can conclude he/she probably like Law since these two in same cluster.

Assignment 2 : Diamonds Data

I plotted carat vs price to see how price is changing according to carat and choosing cut value as a color, I aimed to see relations among price, cut and carat. In plot shown below, we can clearly see an increase in price with increase in carat but we can see all cuts(colors) thorughout price_range and this implies that cut is not significant by its own.

library(rpart) #To construct CART models
library(rpart.plot) # It also includes titanic data
## Warning: package 'rpart.plot' was built under R version 3.4.4
library(rattle) #For visualization
## Warning: package 'rattle' was built under R version 3.4.4
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
ggplot(diamonds) + geom_point(aes(x = carat, y = price, color = cut ))

I also plotted dimensions vs price. Looking plot, we can conclude that there is a positive slope and this implies increasing x,y,z results increasing in price.

ggplot(diamonds) + geom_point(aes(x = x, y = price , color= "Blue")) +  geom_point(aes(x = y, y = price , color = "Yellow")) +  geom_point(aes(x = z, y = price , color = "Red" )) + xlim(2,12)   + labs (color = "Dimension") + scale_color_manual( values = c("Blue","Yellow","Red"), labels = c("x","y","z"))
## Warning: Removed 8 rows containing missing values (geom_point).
## Warning: Removed 9 rows containing missing values (geom_point).
## Warning: Removed 24 rows containing missing values (geom_point).

For predictive model, I splitted data as train and test.

set.seed(503)
library(tidyverse)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>%
group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),
diamonds_test, by = "diamond_id")
diamonds_train
## # A tibble: 43,143 x 11
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.230 Ideal     E     SI2      61.5  55.0   326  3.95  3.98  2.43
##  2 0.210 Premium   E     SI1      59.8  61.0   326  3.89  3.84  2.31
##  3 0.230 Good      E     VS1      56.9  65.0   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4  58.0   334  4.20  4.23  2.63
##  5 0.240 Very Good J     VVS2     62.8  57.0   336  3.94  3.96  2.48
##  6 0.240 Very Good I     VVS1     62.3  57.0   336  3.95  3.98  2.47
##  7 0.260 Very Good H     SI1      61.9  55.0   337  4.07  4.11  2.53
##  8 0.220 Fair      E     VS2      65.1  61.0   337  3.87  3.78  2.49
##  9 0.230 Very Good H     VS1      59.4  61.0   338  4.00  4.05  2.39
## 10 0.300 Good      J     SI1      64.0  55.0   339  4.25  4.28  2.73
## # ... with 43,133 more rows, and 1 more variable: diamond_id <int>

Then, I created a model via rpart ( excluding diamond_id as an variable), and made prediction with train data. For measuring performance(How good is our model), I calculated mean error. It is about 800 in terms of price, which is acceptable good in the range of 0-15.000.

diamonds_model <- rpart(price ~.-diamond_id, data = diamonds_train)
fancyRpartPlot(diamonds_model)

diamonds_in_sample <- predict(diamonds_model) 
diamonds_comp <- cbind(diamonds_train$price, diamonds_in_sample)
colnames(diamonds_comp) <- c("actual", "predicted")
 diamonds_comp <- data.frame(diamonds_comp)
diamonds_comp$diff <- diamonds_comp$actual -diamonds_comp$predicted
mean(abs(diamonds_comp$diff))
## [1] 886.1441
print(head(diamonds_comp))
##   actual predicted      diff
## 1    326  1056.168 -730.1677
## 2    326  1056.168 -730.1677
## 3    327  1056.168 -729.1677
## 4    334  1056.168 -722.1677
## 5    336  1056.168 -720.1677
## 6    336  1056.168 -720.1677

After train, I added test data to prediction and again calculated error.

diamonds_predict <- predict(diamonds_model, newdata= diamonds_test)
diamonds_comp2 <- cbind(diamonds_test$price, diamonds_predict)
colnames(diamonds_comp2) <- c("actual", "predicted")
diamonds_comp2 <- data.frame(diamonds_comp2)
diamonds_comp2$diff <- diamonds_comp2$actual -diamonds_comp2$predicted
mean(abs(diamonds_comp2$diff))
## [1] 889.7092
print(head(diamonds_comp2))
##   actual predicted       diff
## 1  15964 14944.339  1019.6610
## 2   3205  3073.487   131.5130
## 3   3384  3073.487   310.5130
## 4   3634  5394.948 -1760.9476
## 5   2358  3073.487  -715.4870
## 6   4398  5394.948  -996.9476

Assignment 3 : Spam Data

This assignment is about analyzing UCT’s Spambase data and building a model on it to detect further possible spam mails according to it’s features.

Firstly, data is loaded and splitted into two parts as given in dataset as test data and train data .

load("/Users/TCTAGUMUS/Documents/R işleri/spam_data.RData")
spam_test <- spam_data %>% filter(train_test == 1 )
spam_train <- spam_data %>% filter(train_test == 0)

Then, I built a model with train data using rpart and plotted a decision tree. After, predicted according to our model with train data.

spam_model <- rpart(spam_or_not ~ . -train_test, data = spam_train)
fancyRpartPlot(spam_model)

spam_in_sample <- predict(spam_model)
print(head(spam_in_sample))
##          1          2          3          4          5          6 
## 0.86772487 0.94753747 0.94753747 0.90812721 0.90812721 0.05763952

To see how good is my model, I compared actual and predicted values of train data. How many correct prediction that my model make divided by total case is a performance indicator for my case.

in_sample_prediction <- cbind(spam_train$spam_or_not , spam_in_sample)
in_sample_prediction = data.frame ( in_sample_prediction)
in_sample_prediction <- in_sample_prediction %>% mutate(V2= ifelse(spam_in_sample>= 0.5 ,1,0)) %>% mutate(is_correct = V1 == V2) %>% group_by(is_correct) %>% summarise(count = n(), percentage = n()/nrow(.))
in_sample_prediction
## # A tibble: 2 x 3
##   is_correct count percentage
##   <lgl>      <int>      <dbl>
## 1 F            398     0.0970
## 2 T           3703     0.903

As we see results above, our model can help to categorize mails into spam or not, with apprx. 9 successfull guess out of 10 tries.

Finally, I repeated steps using test_data.

test_predict <- predict(spam_model, newdata = spam_test)
print(head(test_predict))
##         1         2         3         4         5         6 
## 0.9475375 0.9081272 0.4009662 0.9475375 0.3214286 0.9475375
test_prediction <- cbind(spam_test$spam_or_not, test_predict)
test_prediction <- data.frame(test_prediction)
test_prediction <- test_prediction %>% mutate(V2= ifelse(test_predict>= 0.5 ,1,0)) %>% mutate(is_correct = V1 == V2) %>% group_by(is_correct) %>% summarise(count = n(), percentage = n()/nrow(.))
test_prediction
## # A tibble: 2 x 3
##   is_correct count percentage
##   <lgl>      <int>      <dbl>
## 1 F             50      0.100
## 2 T            450      0.900