library(readr)
library(dplyr)
library(rpart)
games <- read_csv("C:/Users/Eren/Desktop/tr_super_league_matches.csv")
games$sumOfGoals = (as.numeric(games$Home_Score) + as.numeric(games$Away_Score))
games$over25 = (games$sumOfGoals > 2.5)
games$over35 = (games$sumOfGoals > 3.5)
We’ve calculated the total score to be able to anaylze if the results are 2.5 and 3.5 goals higher or not. And added it to table with true/false columns
games$timeDistance = (2017 - as.numeric(games$season))
Since the teams may differ in accordance with the years, the “year” data ismeaningful for our anaylze. We’ve normalized the years. Meaning that, for example, 2010 season became 7 and 2016 became 1.
matchTrainData <- games %>% filter(season<2017) %>% select(Home,Away,timeDistance,over25,over35)
matchTestData <- games %>% filter(season==2017) %>% select(Home,Away,timeDistance,over25,over35)
Seperate train and test data and select only required columns.
matchTestData <- matchTestData [matchTestData$Home %in% matchTrainData$Home,]
matchTestData <- matchTestData [matchTestData$Away %in% matchTrainData$Away,]
Since the teams “Göztepe” and “Yeni malatya” are promoted to the league in 2017, they have only been included to the test data. So we need to clean them from test data.
underOverModel25 <- rpart(over25 ~., data = (matchTrainData %>% select(-over35) ))
over25Probality <- predict(underOverModel25, newdata = matchTestData)
head(over25Probality)
## 1 2 3 4 5 6
## 0.5203938 0.4967462 0.4967462 0.4967462 0.6594488 0.5203938
To form a model to predict the results based on over 2.5 goals in total, we’ve selected that specific column as a results column and excluded the over 3.5 goals column since just for this prediction and analysis it is not required for us.
table(over25Probality, matchTestData$over25)
##
## over25Probality FALSE TRUE
## 0.376623376623377 11 15
## 0.496746203904555 7 21
## 0.520393811533052 24 21
## 0.659448818897638 13 29
Based on the model, predictions and results, we may state that, from the games we’ve stated to end over 2.5 goals in total with a 65% probability, we’ve correctly guessed 29 of them and wrongfully guessed 13 of them.
underOverModel35 <- rpart(over35 ~., data = (matchTrainData %>% select(-over25) ))
over35Probality <- predict(underOverModel35, newdata = matchTestData)
table(over35Probality, matchTestData$over35)
##
## over35Probality FALSE TRUE
## 0.251885369532428 62 18
## 0.323671497584541 26 22
## 0.507692307692308 7 6
Above we’ve applied the same method to build a model and predict for now 3.5 goals over in total.
The difference here is that, for the games that would end with 3.5 goals over in total with 25% probability, we’ve guessed 62 of them wornd and 18 of them right.
On the other hand, we can interpret this as with 75% probability games will end below 3.5 goals in total. With this logic, we may state that out of 80 games, 62 of them are predicted correctly and 18 of them wrong.
gameResultTrainData <- games %>% filter(season < 2017 ) %>% select (Home,Away,timeDistance,Match_Result)
gameResultTestData <- games %>% filter(season == 2017 ) %>% select (Home,Away,timeDistance,Match_Result)
Seperate train and test data and select only required columns.
gameResultTestData <- gameResultTestData [gameResultTestData$Home %in% gameResultTrainData$Home,]
gameResultTestData <- gameResultTestData [gameResultTestData$Away %in% gameResultTrainData$Away,]
Since the teams “Göztepe” and “Yeni malatya” are promoted to the league in 2017, they have only been included to the test data. So we need to clean them from test data.
gameResultModel <- rpart(Match_Result ~., data = (gameResultTrainData))
gameResultPrediction <- predict(gameResultModel, newdata = gameResultTestData)
head(gameResultPrediction)
## Away Home Tie
## 1 0.2243590 0.4858974 0.2897436
## 2 0.1539792 0.6228374 0.2231834
## 3 0.1539792 0.6228374 0.2231834
## 4 0.1539792 0.6228374 0.2231834
## 5 0.2243590 0.4858974 0.2897436
## 6 0.2243590 0.4858974 0.2897436
We have build the model and make the predicton as same as previous method. However, the diffrecence here is, on the prediction result we now have 3 different columns with title of “Away”, “Home” and “Tie”. Each column cosnsists the probality of the title.
resultTable = colnames(gameResultPrediction)[max.col(gameResultPrediction,ties.method="first")]
gameResultTestData$prediction = resultTable
So we selected the name of the column that has the largest value in the row and appended it to our test table.
gameResultTestData %>% group_by(prediction==Match_Result) %>% summarise(count=n())
## # A tibble: 2 x 2
## `prediction == Match_Result` count
## <lgl> <int>
## 1 FALSE 64
## 2 TRUE 77
And we get the count of the rows that our prediction is equal to original match result