Loading Depentent Libraries and Data

library(tidyverse)
library(rpart)
library(rpart.plot)
library(rattle)
set.seed(111)

Load data

load("~/ETM58D/spam_data.RData")
head(spam_data)
## # A tibble: 6 x 59
##   train_test spam_or_not    V1    V2    V3    V4    V5    V6    V7    V8
##        <dbl>       <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1          0           1  0     0.64  0.64     0  0.32  0     0     0   
## 2          0           1  0.21  0.28  0.5      0  0.14  0.28  0.21  0.07
## 3          0           1  0.06  0     0.71     0  1.23  0.19  0.19  0.12
## 4          0           1  0     0     0        0  0.63  0     0.31  0.63
## 5          0           1  0     0     0        0  0.63  0     0.31  0.63
## 6          0           1  0     0     0        0  1.85  0     0     1.85
## # ... with 49 more variables: V9 <dbl>, V10 <dbl>, V11 <dbl>, V12 <dbl>,
## #   V13 <dbl>, V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>,
## #   V19 <dbl>, V20 <dbl>, V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>,
## #   V25 <dbl>, V26 <dbl>, V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>,
## #   V31 <dbl>, V32 <dbl>, V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>,
## #   V37 <dbl>, V38 <dbl>, V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>,
## #   V43 <dbl>, V44 <dbl>, V45 <dbl>, V46 <dbl>, V47 <dbl>, V48 <dbl>,
## #   V49 <dbl>, V50 <dbl>, V51 <dbl>, V52 <dbl>, V53 <dbl>, V54 <dbl>,
## #   V55 <dbl>, V56 <int>, V57 <int>

Section 1 - Creating Model

I seperated test data with train data and created model using training data. Then create plot of it to see how desicion tree build. Used spam_or_not column as response column.

trainData <- spam_data %>% filter(train_test == 0) %>% select(-train_test)
testData <- spam_data %>% filter(train_test == 1) %>% select(-train_test)

spamModel <- rpart(spam_or_not ~ ., data = trainData)
rpart.plot(spamModel)

Section 1 - Test our model with test data

Test our model with test data.

testPredictionResult <- predict(spamModel, newdata=testData)
head(testPredictionResult)
##         1         2         3         4         5         6 
## 0.9475375 0.9081272 0.4009662 0.9475375 0.3214286 0.9475375

Section 3 - Analyze the result

Append test result to actual test data frame

testData$predictionResult = testPredictionResult

Convert spam_or_not column to binary for easier copmarision

testData$spam_or_not = testData$spam_or_not > 0

Now our prediction result column conains the probality of being spam. Lets try to find optimum tershold value.

testData$predictionResultConverted = testData$predictionResult > 0.5

I’ve manually tested 10 different values and 0.5 gave the optimum result. A loop can be written for this or there might be another optimum soluiton in the R but I don’t know

testData %>% group_by(spam_or_not==predictionResultConverted) %>% summarise(count=n())
## # A tibble: 2 x 2
##   `spam_or_not == predictionResultConverted` count
##   <lgl>                                      <int>
## 1 FALSE                                         50
## 2 TRUE                                         450

Depending on our result we correctly marked spam or not %90 of e-mails.