Diamond

Our assignment consists of finding the price of a diamond given its properties. I used the diamonds data set in ggplot2 package (which is inside tidyverse).

set.seed(503)

data(diamonds)
diamonds

## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.230 Ideal     E     SI2      61.5   55.   326  3.95  3.98  2.43
##  2 0.210 Premium   E     SI1      59.8   61.   326  3.89  3.84  2.31
##  3 0.230 Good      E     VS1      56.9   65.   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4   58.   334  4.20  4.23  2.63
##  5 0.310 Good      J     SI2      63.3   58.   335  4.34  4.35  2.75
##  6 0.240 Very Good J     VVS2     62.8   57.   336  3.94  3.96  2.48
##  7 0.240 Very Good I     VVS1     62.3   57.   336  3.95  3.98  2.47
##  8 0.260 Very Good H     SI1      61.9   55.   337  4.07  4.11  2.53
##  9 0.220 Fair      E     VS2      65.1   61.   337  3.87  3.78  2.49
## 10 0.230 Very Good H     VS1      59.4   61.   338  4.00  4.05  2.39
## # ... with 53,930 more rows

sqldf("select max(price),min(price),avg(price) from diamonds")

##   max(price) min(price) avg(price)
## 1      18823        326     3932.8

I added row number to the data set. 80% of my data is train data and the others are test data:

diamonds2<-diamonds %>% group_by(carat,cut,color,clarity,depth,table,price,x,y,z) %>% mutate(item_id = row_number())

diamonds2<-diamonds  %>% mutate(item_id = row_number())

n = nrow(diamonds2)

train_id = sample(1:n, size = round(0.8*n), replace=FALSE)

train_data = diamonds2[train_id ,]
test_data = diamonds2[-train_id ,]

Using anova method and train data I created my model:

model1 <- rpart(price ~ ., data=train_data[1:10], method ="anova")

rpart.plot(model1, type=3, digits=3, fallen.leaves = TRUE)

After building a model on train data set and I predicted on test:

predicted_test_data <- predict(model1,test_data)

print(head(predicted_test_data))

##        1        2        3        4        5        6 
## 1051.845 1051.845 1051.845 1051.845 1051.845 1051.845

Mean_Absolute_Error <- function(act, pred) {mean(abs(act - pred))}
Mean_Absolute_Error(test_data$price, predicted_test_data)

## [1] 897.2298

References

https://stackoverflow.com/questions/11996135/create-a-sequential-number-counter-for-rows-within-each-group-of-a-dataframe

https://mef-bda503.github.io/pj-nesipoglud/files/Diamonds.html

Diamond

Zeynep Kavcıoğu

25 Nisan 2018