First, we build a logistic regression model to see whether our variables significant or not for cancer.
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------ tidyverse 1.2.1 --
## <U+221A> ggplot2 2.2.1 <U+221A> purrr 0.2.4
## <U+221A> tibble 1.4.2 <U+221A> dplyr 0.7.4
## <U+221A> tidyr 0.8.0 <U+221A> stringr 1.2.0
## <U+221A> readr 1.1.1 <U+221A> forcats 0.2.0
## -- Conflicts --------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stats)
library(ggplot2)
cancer_model <- glm(cbind(ncases,ncontrols) ~ agegp + tobgp + alcgp , data = esoph , family = binomial() )
anova(cancer_model)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: cbind(ncases, ncontrols)
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 87 227.241
## agegp 5 88.128 82 139.112
## tobgp 3 19.085 79 120.028
## alcgp 3 66.054 76 53.973
We can clearly see age is key variable for cancer,since adding “Age” predictor causes a decrease on Residual Deviance by 88.128 . Then, alcohol consumption predictor helps our model fit better, again we can observe this by looking deviance residual. Finally, tobacco consumption also has effect on cancer which is smaller by comprasion to others. All these variables helps us to build a good model with better fit.
If we want to seek dependent effect of tobacco and alcohol, we should build a model loking at intersection effect of alcohol and tobacco consumption.
cancer_model <- glm(cbind(ncases,ncontrols) ~ agegp + tobgp * alcgp , data = esoph , family = binomial() )
anova(cancer_model)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: cbind(ncases, ncontrols)
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 87 227.241
## agegp 5 88.128 82 139.112
## tobgp 3 19.085 79 120.028
## alcgp 3 66.054 76 53.973
## tobgp:alcgp 9 6.489 67 47.484
We can observe small decrease on residual deviance with adding intersection effect of drinking and smoking. We can claim that usage of alcohol and tobacco together increasing probability of being cancer.
Reference: http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/esoph.html
I applied K-means clustering method with 5 cluster center points to see which hobbies are in same cluster and how people are interested in different hobbies.
yr_data <-
read.csv("/Users/TCTAGUMUS/Documents/R işleri/youth_responses.csv",sep=",") %>%
filter(complete.cases(.)) %>%
# mutate(id=row_number()) %>%
tbl_df()
yr_pca<-
yr_data[,sapply(yr_data,class)=="integer"] %>%
select(History:Pets)
yr_mds_data <- yr_pca %>% select(History:Pets)
yr_dist <- 1 - cor(yr_mds_data)
#Apply MDS
yr_mds <- cmdscale(yr_dist,k=2)
colnames(yr_mds) <- c("x","y")
library(ggplot2)
#Set the seed because K-Means algorithm uses a search based method
set.seed(58)
#Apply K-Means
genre_cluster<-kmeans(yr_mds,centers=6)
##Get the clusters
mds_clusters<-
data.frame(genre=names(genre_cluster$cluster),
cluster_mds=genre_cluster$cluster) %>% arrange(cluster_mds,genre)
mds_clusters
## genre cluster_mds
## 1 Art.exhibitions 1
## 2 Dancing 1
## 3 Gardening 1
## 4 Psychology 1
## 5 Reading 1
## 6 Theatre 1
## 7 Writing 1
## 8 Mathematics 2
## 9 Physics 2
## 10 Science.and.technology 2
## 11 Celebrities 3
## 12 Foreign.languages 3
## 13 Fun.with.friends 3
## 14 Pets 3
## 15 Shopping 3
## 16 Cars 4
## 17 Internet 4
## 18 PC 4
## 19 Active.sport 5
## 20 Adrenaline.sports 5
## 21 Economy.Management 5
## 22 Geography 5
## 23 Law 5
## 24 Passive.sport 5
## 25 Politics 5
## 26 Biology 6
## 27 Chemistry 6
## 28 Countryside..outdoors 6
## 29 History 6
## 30 Medicine 6
## 31 Musical.instruments 6
## 32 Religion 6
ggplot(
data.frame(yr_mds) %>% mutate(clusters=as.factor(genre_cluster$cluster),
genres=rownames(yr_mds)),aes(x=x,y=y)) +
geom_text(aes(label=genres,color=clusters),angle=45,size=2) +
geom_point(data=as.data.frame(genre_cluster$centers),aes(x=x,y=y)
)
From clustering table and plot, we can observe how respondants’ hobbies correlated. We can see mathematics, physics and science in 6th cluster, chemistry, medicine and biology in 2nd cluster and so on. That sounds logical. By appliyng this algorithm , we can guess which hobbies may people interest in by knowing some of their other hobbes.For example, if a young individual enjoy politics, we can conclude he/she probably like Law since these two in same cluster.
I plotted carat vs price to see how price is changing according to carat and choosing cut value as a color, I aimed to see relations among price, cut and carat. In plot shown below, we can clearly see an increase in price with increase in carat but we can see all cuts(colors) thorughout price_range and this implies that cut is not significant by its own.
library(rpart) #To construct CART models
library(rpart.plot) # It also includes titanic data
## Warning: package 'rpart.plot' was built under R version 3.4.4
library(rattle) #For visualization
## Warning: package 'rattle' was built under R version 3.4.4
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
ggplot(diamonds) + geom_point(aes(x = carat, y = price, color = cut ))
I also plotted dimensions vs price. Looking plot, we can conclude that there is a positive slope and this implies increasing x,y,z results increasing in price.
ggplot(diamonds) + geom_point(aes(x = x, y = price , color= "Blue")) + geom_point(aes(x = y, y = price , color = "Yellow")) + geom_point(aes(x = z, y = price , color = "Red" )) + xlim(2,12) + labs (color = "Dimension") + scale_color_manual( values = c("Blue","Yellow","Red"), labels = c("x","y","z"))
## Warning: Removed 8 rows containing missing values (geom_point).
## Warning: Removed 9 rows containing missing values (geom_point).
## Warning: Removed 24 rows containing missing values (geom_point).
For predictive model, I splitted data as train and test.
set.seed(503)
library(tidyverse)
diamonds_test <- diamonds %>% mutate(diamond_id = row_number()) %>%
group_by(cut, color, clarity) %>% sample_frac(0.2) %>% ungroup()
diamonds_train <- anti_join(diamonds %>% mutate(diamond_id = row_number()),
diamonds_test, by = "diamond_id")
diamonds_train
## # A tibble: 43,143 x 11
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.230 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
## 2 0.210 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
## 3 0.230 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
## 5 0.240 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
## 6 0.240 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47
## 7 0.260 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53
## 8 0.220 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
## 9 0.230 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39
## 10 0.300 Good J SI1 64.0 55.0 339 4.25 4.28 2.73
## # ... with 43,133 more rows, and 1 more variable: diamond_id <int>
Then, I created a model via rpart ( excluding diamond_id as an variable), and made prediction with train data. For measuring performance(How good is our model), I calculated mean error. It is about 800 in terms of price, which is acceptable good in the range of 0-15.000.
diamonds_model <- rpart(price ~.-diamond_id, data = diamonds_train)
fancyRpartPlot(diamonds_model)
diamonds_in_sample <- predict(diamonds_model)
diamonds_comp <- cbind(diamonds_train$price, diamonds_in_sample)
colnames(diamonds_comp) <- c("actual", "predicted")
diamonds_comp <- data.frame(diamonds_comp)
diamonds_comp$diff <- diamonds_comp$actual -diamonds_comp$predicted
mean(abs(diamonds_comp$diff))
## [1] 886.1441
print(head(diamonds_comp))
## actual predicted diff
## 1 326 1056.168 -730.1677
## 2 326 1056.168 -730.1677
## 3 327 1056.168 -729.1677
## 4 334 1056.168 -722.1677
## 5 336 1056.168 -720.1677
## 6 336 1056.168 -720.1677
After train, I added test data to prediction and again calculated error.
diamonds_predict <- predict(diamonds_model, newdata= diamonds_test)
diamonds_comp2 <- cbind(diamonds_test$price, diamonds_predict)
colnames(diamonds_comp2) <- c("actual", "predicted")
diamonds_comp2 <- data.frame(diamonds_comp2)
diamonds_comp2$diff <- diamonds_comp2$actual -diamonds_comp2$predicted
mean(abs(diamonds_comp2$diff))
## [1] 889.7092
print(head(diamonds_comp2))
## actual predicted diff
## 1 15964 14944.339 1019.6610
## 2 3205 3073.487 131.5130
## 3 3384 3073.487 310.5130
## 4 3634 5394.948 -1760.9476
## 5 2358 3073.487 -715.4870
## 6 4398 5394.948 -996.9476
This assignment is about analyzing UCT’s Spambase data and building a model on it to detect further possible spam mails according to it’s features.
Firstly, data is loaded and splitted into two parts as given in dataset as test data and train data .
load("/Users/TCTAGUMUS/Documents/R işleri/spam_data.RData")
spam_test <- spam_data %>% filter(train_test == 1 )
spam_train <- spam_data %>% filter(train_test == 0)
Then, I built a model with train data using rpart and plotted a decision tree. After, predicted according to our model with train data.
spam_model <- rpart(spam_or_not ~ . -train_test, data = spam_train)
fancyRpartPlot(spam_model)
spam_in_sample <- predict(spam_model)
print(head(spam_in_sample))
## 1 2 3 4 5 6
## 0.86772487 0.94753747 0.94753747 0.90812721 0.90812721 0.05763952
To see how good is my model, I compared actual and predicted values of train data. How many correct prediction that my model make divided by total case is a performance indicator for my case.
in_sample_prediction <- cbind(spam_train$spam_or_not , spam_in_sample)
in_sample_prediction = data.frame ( in_sample_prediction)
in_sample_prediction <- in_sample_prediction %>% mutate(V2= ifelse(spam_in_sample>= 0.5 ,1,0)) %>% mutate(is_correct = V1 == V2) %>% group_by(is_correct) %>% summarise(count = n(), percentage = n()/nrow(.))
in_sample_prediction
## # A tibble: 2 x 3
## is_correct count percentage
## <lgl> <int> <dbl>
## 1 F 398 0.0970
## 2 T 3703 0.903
As we see results above, our model can help to categorize mails into spam or not, with apprx. 9 successfull guess out of 10 tries.
Finally, I repeated steps using test_data.
test_predict <- predict(spam_model, newdata = spam_test)
print(head(test_predict))
## 1 2 3 4 5 6
## 0.9475375 0.9081272 0.4009662 0.9475375 0.3214286 0.9475375
test_prediction <- cbind(spam_test$spam_or_not, test_predict)
test_prediction <- data.frame(test_prediction)
test_prediction <- test_prediction %>% mutate(V2= ifelse(test_predict>= 0.5 ,1,0)) %>% mutate(is_correct = V1 == V2) %>% group_by(is_correct) %>% summarise(count = n(), percentage = n()/nrow(.))
test_prediction
## # A tibble: 2 x 3
## is_correct count percentage
## <lgl> <int> <dbl>
## 1 F 50 0.100
## 2 T 450 0.900