Our assignment consists of finding the price of a diamond given its properties. I used the diamonds data set in ggplot2 package (which is inside tidyverse).
set.seed(503)
data(diamonds)
diamonds
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.230 Ideal E SI2 61.5 55. 326 3.95 3.98 2.43
## 2 0.210 Premium E SI1 59.8 61. 326 3.89 3.84 2.31
## 3 0.230 Good E VS1 56.9 65. 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58. 334 4.20 4.23 2.63
## 5 0.310 Good J SI2 63.3 58. 335 4.34 4.35 2.75
## 6 0.240 Very Good J VVS2 62.8 57. 336 3.94 3.96 2.48
## 7 0.240 Very Good I VVS1 62.3 57. 336 3.95 3.98 2.47
## 8 0.260 Very Good H SI1 61.9 55. 337 4.07 4.11 2.53
## 9 0.220 Fair E VS2 65.1 61. 337 3.87 3.78 2.49
## 10 0.230 Very Good H VS1 59.4 61. 338 4.00 4.05 2.39
## # ... with 53,930 more rows
sqldf("select max(price),min(price),avg(price) from diamonds")
## max(price) min(price) avg(price)
## 1 18823 326 3932.8
I added row number to the data set. 80% of my data is train data and the others are test data:
diamonds2<-diamonds %>% group_by(carat,cut,color,clarity,depth,table,price,x,y,z) %>% mutate(item_id = row_number())
diamonds2<-diamonds %>% mutate(item_id = row_number())
n = nrow(diamonds2)
train_id = sample(1:n, size = round(0.8*n), replace=FALSE)
train_data = diamonds2[train_id ,]
test_data = diamonds2[-train_id ,]
Using anova method and train data I created my model:
model1 <- rpart(price ~ ., data=train_data[1:10], method ="anova")
rpart.plot(model1, type=3, digits=3, fallen.leaves = TRUE)
After building a model on train data set and I predicted on test:
predicted_test_data <- predict(model1,test_data)
print(head(predicted_test_data))
## 1 2 3 4 5 6
## 1051.845 1051.845 1051.845 1051.845 1051.845 1051.845
Mean_Absolute_Error <- function(act, pred) {mean(abs(act - pred))}
Mean_Absolute_Error(test_data$price, predicted_test_data)
## [1] 897.2298
References
https://mef-bda503.github.io/pj-nesipoglud/files/Diamonds.html