Issue
This Content is from Stack Overflow. Question asked by mapleleaf
I am trying to understand why changing the reference level of a factor changes the results of a model. Consider this example:
library(liver)
library(caret)
library(glmnet)
library(dplyr)
data(churn)
head(churn)
# set reference levels
churn$state <- relevel(churn$state, ref = "NE")
churn$area.code <- relevel(churn$area.code, ref = "area_code_408")
churn$intl.plan <- relevel(churn$intl.plan, ref = "yes")
churn$voice.plan <- relevel(churn$voice.plan, ref = "no")
# split into train and test
set.seed(1)
train.index <- createDataPartition(churn$churn, p = 0.8, list = FALSE)
train_churn <- churn[train.index,]
test_churn <- churn[-train.index,]
# add class weights
my_weights = train_churn %>%
select(churn) %>%
group_by(churn) %>%
count()
weight_for_yes = (1 / my_weights$n[1]) * ((my_weights$n[1] + my_weights$n[2]) / 2.0)
weight_for_yes
weight_for_no = (1 / my_weights$n[2]) * ((my_weights$n[1] + my_weights$n[2]) / 2.0)
weight_for_no
model_weights <- ifelse(train_churn$churn == "yes", weight_for_yes, weight_for_no)
# tuning grid
myGrid <- expand.grid(
alpha = 0,
lambda = seq(0,1,0.01)
)
set.seed(1)
mod_1 <- train(churn ~
state +
area.code +
intl.plan +
voice.plan,
data = train_churn,
method = "glmnet",
tuneGrid = myGrid,
weights = model_weights)
mod_1
Tuning parameter ‘alpha’ was held constant at a value of 0
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0 and lambda = 0.8
prediction <- predict(mod_1, newdata = test_churn)
confusionMatrix(prediction, test_churn$churn)
I also look at a single new prediction
new_data = data.frame(state = c("CA"),
area.code = c("area_code_510"),
intl.plan = c("yes"),
voice.plan = c("no"))
predict(mod_1, newdata = new_data, type = "prob")
Now I restart R, set new reference levels, and rerun all the code. This is the output
# set new reference levels
churn$state <- relevel(churn$state, ref = "OR")
churn$area.code <- relevel(churn$area.code, ref = "area_code_415")
churn$intl.plan <- relevel(churn$intl.plan, ref = "no")
churn$voice.plan <- relevel(churn$voice.plan, ref = "yes")
Tuning parameter ‘alpha’ was held constant at a value of 0
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0 and lambda = 0.62.
I expected the lambdas to change, but I did not expect the confusion matrix or classification probabilities to changes. Is this normal?
Solution
This question is not yet answered, be the first one who answer using the comment. Later the confirmed answer will be published as the solution.
This Question and Answer are collected from stackoverflow and tested by JTuto community, is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.