SpamData

Our assignment consists of buiding a CART model to detect spam mail using UCI’s Spambase data and analyze it.

First I load the spam data.

setwd("C:/Users/yagizbarali/Desktop")
load("spam_data.RData")
head(spam_data)

## # A tibble: 6 x 59
##   train_test spam_or_not     V1    V2    V3    V4    V5    V6    V7     V8
##        <dbl>       <int>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1         0.           1 0.     0.640 0.640    0. 0.320 0.    0.    0.    
## 2         0.           1 0.210  0.280 0.500    0. 0.140 0.280 0.210 0.0700
## 3         0.           1 0.0600 0.    0.710    0. 1.23  0.190 0.190 0.120 
## 4         0.           1 0.     0.    0.       0. 0.630 0.    0.310 0.630 
## 5         0.           1 0.     0.    0.       0. 0.630 0.    0.310 0.630 
## 6         0.           1 0.     0.    0.       0. 1.85  0.    0.    1.85  
## # ... with 49 more variables: V9 <dbl>, V10 <dbl>, V11 <dbl>, V12 <dbl>,
## #   V13 <dbl>, V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>,
## #   V19 <dbl>, V20 <dbl>, V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>,
## #   V25 <dbl>, V26 <dbl>, V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>,
## #   V31 <dbl>, V32 <dbl>, V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>,
## #   V37 <dbl>, V38 <dbl>, V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>,
## #   V43 <dbl>, V44 <dbl>, V45 <dbl>, V46 <dbl>, V47 <dbl>, V48 <dbl>,
## #   V49 <dbl>, V50 <dbl>, V51 <dbl>, V52 <dbl>, V53 <dbl>, V54 <dbl>,
## #   V55 <dbl>, V56 <int>, V57 <int>

Cart Model: (with training data)

traindata<-subset(spam_data,train_test==0)
testdata<-subset(spam_data,train_test==1)

model <- rpart(spam_or_not ~ ., data=traindata,method = "class")

fancyRpartPlot(model)

printcp(model)

## 
## Classification tree:
## rpart(formula = spam_or_not ~ ., data = traindata, method = "class")
## 
## Variables actually used in tree construction:
## [1] V16 V25 V52 V53 V57 V7 
## 
## Root node error: 1605/4101 = 0.39137
## 
## n= 4101 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.481620      0   1.00000 1.00000 0.019473
## 2 0.143925      1   0.51838 0.54330 0.016326
## 3 0.049221      2   0.37445 0.44174 0.015088
## 4 0.037383      3   0.32523 0.34268 0.013597
## 5 0.030530      4   0.28785 0.32461 0.013287
## 6 0.011838      5   0.25732 0.29034 0.012663
## 7 0.010000      6   0.24548 0.27227 0.012311

Best Complexity Parameter:

bestcp <- model$cptable[which.min(model$cptable[,"xerror"]),"CP"]
pruned_tree <- prune(model, cp = bestcp)
printcp(pruned_tree)

## 
## Classification tree:
## rpart(formula = spam_or_not ~ ., data = traindata, method = "class")
## 
## Variables actually used in tree construction:
## [1] V16 V25 V52 V53 V57 V7 
## 
## Root node error: 1605/4101 = 0.39137
## 
## n= 4101 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.481620      0   1.00000 1.00000 0.019473
## 2 0.143925      1   0.51838 0.54330 0.016326
## 3 0.049221      2   0.37445 0.44174 0.015088
## 4 0.037383      3   0.32523 0.34268 0.013597
## 5 0.030530      4   0.28785 0.32461 0.013287
## 6 0.011838      5   0.25732 0.29034 0.012663
## 7 0.010000      6   0.24548 0.27227 0.012311

ı used pruned tree to produce confusion matrice:

conf.matrix <- table(traindata$spam_or_not, predict(pruned_tree,type="class"))

Actual and Predicted Values:

rownames(conf.matrix) <- paste("Actual Value", rownames(conf.matrix), sep = ":")

colnames(conf.matrix) <- paste("Pred Value", colnames(conf.matrix), sep = ":")

print(conf.matrix)

##                 
##                  Pred Value:0 Pred Value:1
##   Actual Value:0         2381          115
##   Actual Value:1          279         1326

References:

https://www.edureka.co/blog/implementation-of-decision-tree

https://mef-bda503.github.io/pj-ferayece/files/Assignment3_SpamDataAnalysis.html

https://rstudio-pubs-static.s3.amazonaws.com/27179_e64f0de316fc4f169d6ca300f18ee2aa.html

SpamData

Zeynep Kavcıoğu

29 Nisan 2018