This R programming solution demonstrates how to predict the mode of transportation using a decision tree classifier. Remember to preprocess the data (e.g., handle missing values, convert categorical variables to factors) before building the model. Additionally, you can further optimize the model, perform cross-validation, or explore other classification algorithms to improve prediction accuracy.
title: "Predicting Mode of Transportation"
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(warn = -1)
```
# Exploratory Data Analysis and Insights
We have the cars dataset with nine variables and 418 data points. The primary aim of this analysis to predict if an employee would use car as mode of transport.
```{r p1, echo=FALSE, eval=TRUE}
library(tidyverse)
library(ggplot2)
library(dplyr)
library(readr)
```
```{r p11}
Cars <- read_csv("Cars-dataset.csv")
Cars$Gender<- as.factor(Cars$Gender)
Cars$Engineer <- as.factor(Cars$Engineer)
Cars$MBA <- as.factor(Cars$MBA)
Cars$license <- as.factor(Cars$license)
Cars$Transport <- as.factor(Cars$Transport)
attach(Cars)
Transport_Car = as.factor(as.integer(Transport=='Car'))
Cars = cbind(Cars, Transport_Car)
```
We look at the exploratory analysis to get ideas for analysis and conclusions. We have used ggplot to create charts.
```{r p12}
ggplot(Cars, aes(x = Gender, y = ..count.., fill = Transport )) + geom_bar() + labs(title = "Gender by Transport modes")
ggplot(Cars, aes(x = Transport, y = ..count.., fill = license)) + geom_bar() + labs(title = "Transport mode by licence")
```
```
From the above plots, we can see that a large number of employees do not have license and they use public trnasport. There are very small section who comes with car without having the license. The plot for Gender indicates that very small proportion of females use Car for office commute.
```
```{r p13}
ggplot(Cars, aes(x = Transport, y = Salary)) + geom_boxplot() + labs(title = "Salaries against modes of transportation")
ggplot(Cars, aes(x = Transport, y = Distance)) + geom_boxplot() + labs(title = "Distance against modes of transportation")
```
```
Both the boxplots are very informative. The first boxplot indicares that employees using Cars for office commute have significantly higher salary than the other two groups. We do not see the overlap of boxes for Car and other two categories hence, the difference is statistically significant. The other boxplot outlines the effect of distance on trnsportation, espacially cars. The average distance is significantly higher for car users than other.
```
```{r p14}
temp = group_by(Cars, Transport) %>% summarise(Age = mean(Age))
ggplot(temp, aes(x = Transport, y = Age)) + geom_bar(stat = "Identity") + labs(title = "Average Age by Transport type")
```
```
The average age of employees using car as mode of transportation is higher than the other two modes.
```
## Challenge
The main challenge in the prediction is the class imbalance. There are only 35 observation in car category out of 418 total observation which is around $8\%$. The usual accuracy metric would not be much helpful in assessing model performances since it can just be increased by predicting everything in non-car group. The model estimations would also be biased as the standard loss functions do not consider class imbalance in the data.
We shall use F1 score as performance metric to compare and assess model performance which is harmonic mean of Sensitivity and Specificity.
# Data preparation
```{r p3}
Gender_1 = as.factor(as.numeric(Gender == 'Male'))
Cars = cbind(Cars, Gender_1)
Final_data = Cars[,-c(2,9)]
Final_data[is.na(Final_data)] = 0
set.seed(100)
tr_samp = sample(418, floor(0.7*418), F)
train = Final_data[tr_samp,]
test = Final_data[-tr_samp,]
```
We have prepared the data for modelling. The variables which are categorical in nature have been converted into factor data types. The response variable is transformed into 0-1 indicator with 1 indicating Car as mode of transport and 0 represents other. The gender variable have been has been converted to a numeric 0-1 factor variable. The data has been divided into training and testing set with 70% of data into training set. All the models will be trained on training set and test set would be used to asses model performance on out of sample observations.
# Modelling
## kNN
```{r p4}
library(class)
library(caret)
library(MLmetrics)
model1 = knn(train[,-8], test[,-8], train$Transport_Car, k=5)
confusionMatrix(model1, test$Transport_Car, positive = '1')
F1_Score(model1, test$Transport_Car, positive = 1)
```
The kNN with $k=5$ gives us F1 score of 0.857 which is good! Only 3 observations were wrongly classified.
## Naive Bayes
```{r p5}
library(e1071)
model2 = naiveBayes(Transport_Car ~., train)
pred_nb = predict(model2, test)
confusionMatrix(pred_nb, test$Transport_Car, positive = '1')
F1_Score(pred_nb, test$Transport_Car, positive = 1)
```
The naive Bayes model has performed worse than the kNN model. The prediction for class '1' was same although there were 2 observation from class 0 that was predicted in class 1. The F1 score for test data was 0.783
## Logistic Regression
```{r p6}
library(MASS)
model3= glm(Transport_Car ~., train, family = binomial)
pred_glm = predict(model3, test, type = "response")
pred_glm = as.factor(as.integer(pred_glm>0.5))
confusionMatrix(pred_glm, test$Transport_Car, positive = '1')
F1_Score(pred_glm, test$Transport_Car, positive = 1)
```
The model performance of logistic regression model is identical to that of kNN model. The F1 score is 0.857 and only 3 classifications.
## Bagging and Boosting
### Bagging
We shall use adabag package for bagging. Bagging refers to bootstrap aggregation which uses many bootstrap samples to train the model and combine the model output to give final decision.
```{r pbag}
library(ipred)
library(rpart)
library(randomForest)
library(adabag)
model_bag = bagging(Transport_Car ~., data = train, mfinal = 3)
pred_bag = as.factor(predict.bagging(model_bag, test)$class)
confusionMatrix(pred_bag, test$Transport_Car, positive = '1')
F1_Score(pred_bag, test$Transport_Car, positive = 1)
```
The model performance of bagging through decision trees was identical to that of naive Bayes model. The F1 score being 0.783. This model underperforms the kNN and logisictic regression.
### Boosting
```{r pboost}
model_boost = boosting(Transport_Car ~., data = train, mfinal = 3, boos = F)
pred_boost = as.factor(predict.boosting(model_boost, test)$class)
confusionMatrix(pred_boost, test$Transport_Car, positive = '1')
F1_Score(pred_boost, test$Transport_Car, positive = 1)
```
The model performance of boosting through decision trees was identical to that of kNN and logistic regression. The F1 score being 0.857.
# Summary
After implementing various machine learning models, we found out that all models except the naive Bayes and bootstrap aggregating has exactly same performance. All thees models had exactly 3 classifications on test set which were all in group "car". This may be because those three data points were outliers e.g. using car with no licence. The models were quite good in predicting the transportation mode as car for employees. The model performances may further improve by increasing sample.