library(readr)
library(dplyr)
library(rpart)

Section 1 - Data Preperation

games <- read_csv("C:/Users/Eren/Desktop/tr_super_league_matches.csv")
games$sumOfGoals = (as.numeric(games$Home_Score) + as.numeric(games$Away_Score))
games$over25 = (games$sumOfGoals > 2.5)
games$over35 = (games$sumOfGoals > 3.5)

We’ve calculated the total score to be able to anaylze if the results are 2.5 and 3.5 goals higher or not. And added it to table with true/false columns

games$timeDistance = (2017 - as.numeric(games$season))

Since the teams may differ in accordance with the years, the “year” data ismeaningful for our anaylze. We’ve normalized the years. Meaning that, for example, 2010 season became 7 and 2016 became 1.

matchTrainData <- games %>% filter(season<2017) %>% select(Home,Away,timeDistance,over25,over35)
matchTestData <- games %>% filter(season==2017) %>% select(Home,Away,timeDistance,over25,over35)

Seperate train and test data and select only required columns.

matchTestData <- matchTestData [matchTestData$Home %in% matchTrainData$Home,]
matchTestData <- matchTestData [matchTestData$Away %in% matchTrainData$Away,]

Since the teams “Göztepe” and “Yeni malatya” are promoted to the league in 2017, they have only been included to the test data. So we need to clean them from test data.

Section 2 - Predicting Under/Over Results

underOverModel25 <- rpart(over25 ~., data = (matchTrainData %>% select(-over35) ))
over25Probality <- predict(underOverModel25, newdata = matchTestData)
head(over25Probality)
##         1         2         3         4         5         6 
## 0.5203938 0.4967462 0.4967462 0.4967462 0.6594488 0.5203938

To form a model to predict the results based on over 2.5 goals in total, we’ve selected that specific column as a results column and excluded the over 3.5 goals column since just for this prediction and analysis it is not required for us.

table(over25Probality, matchTestData$over25)
##                    
## over25Probality     FALSE TRUE
##   0.376623376623377    11   15
##   0.496746203904555     7   21
##   0.520393811533052    24   21
##   0.659448818897638    13   29

Based on the model, predictions and results, we may state that, from the games we’ve stated to end over 2.5 goals in total with a 65% probability, we’ve correctly guessed 29 of them and wrongfully guessed 13 of them.

underOverModel35 <- rpart(over35 ~., data = (matchTrainData %>% select(-over25) ))
over35Probality <- predict(underOverModel35, newdata = matchTestData)
table(over35Probality, matchTestData$over35)
##                    
## over35Probality     FALSE TRUE
##   0.251885369532428    62   18
##   0.323671497584541    26   22
##   0.507692307692308     7    6

Above we’ve applied the same method to build a model and predict for now 3.5 goals over in total.

The difference here is that, for the games that would end with 3.5 goals over in total with 25% probability, we’ve guessed 62 of them wornd and 18 of them right.

On the other hand, we can interpret this as with 75% probability games will end below 3.5 goals in total. With this logic, we may state that out of 80 games, 62 of them are predicted correctly and 18 of them wrong.

Section 3 - Game Result Prediction

gameResultTrainData <- games %>% filter(season < 2017 ) %>% select (Home,Away,timeDistance,Match_Result)
gameResultTestData <- games %>% filter(season == 2017 ) %>% select (Home,Away,timeDistance,Match_Result)

Seperate train and test data and select only required columns.

gameResultTestData <- gameResultTestData [gameResultTestData$Home %in% gameResultTrainData$Home,]
gameResultTestData <- gameResultTestData [gameResultTestData$Away %in% gameResultTrainData$Away,] 

Since the teams “Göztepe” and “Yeni malatya” are promoted to the league in 2017, they have only been included to the test data. So we need to clean them from test data.

gameResultModel <- rpart(Match_Result ~., data = (gameResultTrainData))
gameResultPrediction <- predict(gameResultModel, newdata = gameResultTestData)
head(gameResultPrediction)
##        Away      Home       Tie
## 1 0.2243590 0.4858974 0.2897436
## 2 0.1539792 0.6228374 0.2231834
## 3 0.1539792 0.6228374 0.2231834
## 4 0.1539792 0.6228374 0.2231834
## 5 0.2243590 0.4858974 0.2897436
## 6 0.2243590 0.4858974 0.2897436

We have build the model and make the predicton as same as previous method. However, the diffrecence here is, on the prediction result we now have 3 different columns with title of “Away”, “Home” and “Tie”. Each column cosnsists the probality of the title.

resultTable = colnames(gameResultPrediction)[max.col(gameResultPrediction,ties.method="first")]
gameResultTestData$prediction = resultTable

So we selected the name of the column that has the largest value in the row and appended it to our test table.

gameResultTestData %>% group_by(prediction==Match_Result) %>% summarise(count=n())
## # A tibble: 2 x 2
##   `prediction == Match_Result` count
##   <lgl>                        <int>
## 1 FALSE                           64
## 2 TRUE                            77

And we get the count of the rows that our prediction is equal to original match result