Our assignment consists of buiding a CART model to detect spam mail using UCI’s Spambase data and analyze it.
First I load the spam data.
setwd("C:/Users/yagizbarali/Desktop")
load("spam_data.RData")
head(spam_data)
## # A tibble: 6 x 59
## train_test spam_or_not V1 V2 V3 V4 V5 V6 V7 V8
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0. 1 0. 0.640 0.640 0. 0.320 0. 0. 0.
## 2 0. 1 0.210 0.280 0.500 0. 0.140 0.280 0.210 0.0700
## 3 0. 1 0.0600 0. 0.710 0. 1.23 0.190 0.190 0.120
## 4 0. 1 0. 0. 0. 0. 0.630 0. 0.310 0.630
## 5 0. 1 0. 0. 0. 0. 0.630 0. 0.310 0.630
## 6 0. 1 0. 0. 0. 0. 1.85 0. 0. 1.85
## # ... with 49 more variables: V9 <dbl>, V10 <dbl>, V11 <dbl>, V12 <dbl>,
## # V13 <dbl>, V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>,
## # V19 <dbl>, V20 <dbl>, V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>,
## # V25 <dbl>, V26 <dbl>, V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>,
## # V31 <dbl>, V32 <dbl>, V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>,
## # V37 <dbl>, V38 <dbl>, V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>,
## # V43 <dbl>, V44 <dbl>, V45 <dbl>, V46 <dbl>, V47 <dbl>, V48 <dbl>,
## # V49 <dbl>, V50 <dbl>, V51 <dbl>, V52 <dbl>, V53 <dbl>, V54 <dbl>,
## # V55 <dbl>, V56 <int>, V57 <int>
Cart Model: (with training data)
traindata<-subset(spam_data,train_test==0)
testdata<-subset(spam_data,train_test==1)
model <- rpart(spam_or_not ~ ., data=traindata,method = "class")
fancyRpartPlot(model)
printcp(model)
##
## Classification tree:
## rpart(formula = spam_or_not ~ ., data = traindata, method = "class")
##
## Variables actually used in tree construction:
## [1] V16 V25 V52 V53 V57 V7
##
## Root node error: 1605/4101 = 0.39137
##
## n= 4101
##
## CP nsplit rel error xerror xstd
## 1 0.481620 0 1.00000 1.00000 0.019473
## 2 0.143925 1 0.51838 0.54330 0.016326
## 3 0.049221 2 0.37445 0.44174 0.015088
## 4 0.037383 3 0.32523 0.34268 0.013597
## 5 0.030530 4 0.28785 0.32461 0.013287
## 6 0.011838 5 0.25732 0.29034 0.012663
## 7 0.010000 6 0.24548 0.27227 0.012311
Best Complexity Parameter:
bestcp <- model$cptable[which.min(model$cptable[,"xerror"]),"CP"]
pruned_tree <- prune(model, cp = bestcp)
printcp(pruned_tree)
##
## Classification tree:
## rpart(formula = spam_or_not ~ ., data = traindata, method = "class")
##
## Variables actually used in tree construction:
## [1] V16 V25 V52 V53 V57 V7
##
## Root node error: 1605/4101 = 0.39137
##
## n= 4101
##
## CP nsplit rel error xerror xstd
## 1 0.481620 0 1.00000 1.00000 0.019473
## 2 0.143925 1 0.51838 0.54330 0.016326
## 3 0.049221 2 0.37445 0.44174 0.015088
## 4 0.037383 3 0.32523 0.34268 0.013597
## 5 0.030530 4 0.28785 0.32461 0.013287
## 6 0.011838 5 0.25732 0.29034 0.012663
## 7 0.010000 6 0.24548 0.27227 0.012311
ı used pruned tree to produce confusion matrice:
conf.matrix <- table(traindata$spam_or_not, predict(pruned_tree,type="class"))
Actual and Predicted Values:
rownames(conf.matrix) <- paste("Actual Value", rownames(conf.matrix), sep = ":")
colnames(conf.matrix) <- paste("Pred Value", colnames(conf.matrix), sep = ":")
print(conf.matrix)
##
## Pred Value:0 Pred Value:1
## Actual Value:0 2381 115
## Actual Value:1 279 1326
References:
https://www.edureka.co/blog/implementation-of-decision-tree
https://mef-bda503.github.io/pj-ferayece/files/Assignment3_SpamDataAnalysis.html
https://rstudio-pubs-static.s3.amazonaws.com/27179_e64f0de316fc4f169d6ca300f18ee2aa.html