Predictive Modeling: FN Voters in France (Decision Trees and Random Forests)

My predictive modeling skills have been mostly theoretical as I rarely took the time to practice them on real data. That’s why I decided to run this analysis and familiarize myself with the different concepts.

I’ll be using data from the European Social Survey to build a decision tree and a random forest model of the Front National vote (now Rassemblement National) in France. These are the different variables that I’ll be including in the analysis:
fnvote: respondent voted for FN (0/1)
lrscale: Left-right ideological scale (1:10)
stfgov: Satisfaction with the national government (1:10)
imwbcnt: Attitudes towards immigration (1:10)
rlgatnd: Religious services attendance (1:7)
pray: Prayer apart from religious services (1:7)
agea: Age in years
eduyrs: Education in years
hinctnta: Total household income (percentiles)

Before moving on to cleaning and organizing data, I need to start with a caveat. When it comes to predictive analytics, the more data the better. However, the sample size that I’m using is considered small despite merging several waves together. That’s why I’ll be interpreting the results with caution.

fr08 = ESS4 %>% filter(cntry=="FR")
fr10= ESS5 %>% filter(cntry=="FR")
fr12= ESS6 %>% filter(cntry=="FR")
fr14= ESS7 %>% filter(cntry=="FR")
fr16= ESS8 %>% filter(cntry=="FR")
fr18= ESS9 %>% filter(cntry=="FR")

fr08 = fr08 %>% rename(votept = prtvtbfr) %>% mutate(fnvote = ifelse(votept==2, 1, 0))
fr10 = fr10 %>% rename(votept = prtvtbfr) %>% mutate(fnvote = ifelse(votept==2, 1, 0))
fr12 = fr12 %>% rename(votept = prtvtcfr) %>% mutate(fnvote = ifelse(votept==2, 1, 0))
fr14 = fr14 %>% rename(votept = prtvtcfr) %>% mutate(fnvote = ifelse(votept==2, 1, 0))
fr16 = fr16 %>% rename(votept = prtvtcfr) %>% mutate(fnvote = ifelse(votept==2, 1, 0))
fr18 = fr18 %>% rename(votept = prtvtdfr) %>% mutate(fnvote = ifelse(votept==11, 1, 0))

fr08 = fr08 %>% select(lrscale, stfgov, imwbcnt, rlgatnd, pray, gndr, agea, eduyrs, hinctnta, votept, fnvote)
fr10 = fr10 %>% select(lrscale, stfgov, imwbcnt, rlgatnd, pray, gndr, agea, eduyrs, hinctnta, votept, fnvote)
fr12 = fr12 %>% select(lrscale, stfgov, imwbcnt, rlgatnd, pray, gndr, agea, eduyrs, hinctnta, votept, fnvote)
fr14 = fr14 %>% select(lrscale, stfgov, imwbcnt, rlgatnd, pray, gndr, agea, eduyrs, hinctnta, votept, fnvote)
fr16 = fr16 %>% select(lrscale, stfgov, imwbcnt, rlgatnd, pray, gndr, agea, eduyrs, hinctnta, votept, fnvote)
fr18 = fr18 %>% select(lrscale, stfgov, imwbcnt, rlgatnd, pray, gndr, agea, eduyrs, hinctnta, votept, fnvote)

fr = rbind(fr08, fr10,fr12,fr14,fr16,fr18)

fr = fr %>% mutate(female = ifelse(gndr == 2, 1, 0))
fr = select(fr, -votept)

First, I’m going to split the data frame into a training set and a testing set. The proportion default is 3/4 but I’m going to use the 80/20 rule for this analysis.

fr %>%
  resample_partition(p = c(test = 0.2, train = 0.8)) -> frpart

frpart %>%
  pluck("train") %>%
  as_tibble() -> training

frpart %>%
  pluck("test") %>%
  as_tibble() -> testing

Decision Tree:

dtree <- rpart::rpart(as.factor(fnvote) ~ ., data = training)
plot(partykit::as.party(dtree))

Random Forest:

rforest <- randomForest(as.factor(fnvote) ~ ., 
                    data=training, 
                    ntree=500,
                    mtry = 3,
                    oob_score = TRUE,
                    na.action = na.exclude)

importance(rforest) %>% tbl_df() %>%
  mutate(var = rownames(importance(rforest))) %>% 
  select(2, 1) 

Results:

# A tibble: 10 x 2
   var      MeanDecreaseGini
   <chr>               <dbl>
 1 lrscale              97.3
 2 stfgov               76.4
 3 imwbcnt             103. 
 4 rlgatnd              42.6
 5 pray                 45.2
 6 gndr                 13.3
 7 agea                138. 
 8 eduyrs               92.0
 9 hinctnta             76.2
10 female               13.2

The decision tree model suggests that respondents’ vote for the FN can be mostly predicted by their age, ideology, and attitudes towards immigration. This is also confirmed by the random forest model. The previous output summarizes the results from the random forest by the importance of variables. The mean decrease indicates the average of the total decrease in node impurity weighted by the proportion of samples reaching that node.

Model Assessment

The most efficient way to evaluate the performance of a model is by assessing its accuracy. Simply put, it calculates whether the model made correct predictions.

training %>% 
  mutate(pred = predict(dtree,type="class", newdata=.)) %>%
  select(fnvote, pred) %>%
  mutate(pred = as.numeric(pred)-1) %>%
  mutate(category = "Training: Decision Tree") -> traindtree

training %>% na.omit %>%
  mutate(pred = rforest$predicted) %>%
  select(fnvote, pred) %>% 
  mutate(pred = as.numeric(pred)-1) %>%
  mutate(category = "Training: Random Forest") -> trainrforest

testing %>%
  mutate(pred = predict(dtree,type="class", newdata=.)) %>%
  select(fnvote, pred) %>%
  mutate(pred = as.numeric(pred)-1) %>%
  mutate(category = "Testing: Decision Tree") -> testdtree

testing %>% 
  mutate(pred = predict(rforest, newdata=.)) %>%
  select(fnvote, pred) %>%
  mutate(pred = as.numeric(pred)-1) %>%
  mutate(category = "Testing: Random Forest") -> testrforest

bind_rows(traindtree, trainrforest, testdtree, testrforest) %>%
  group_by(category, fnvote, pred) %>%
  count() %>% na.omit %>%
  group_by(category) %>%
  mutate(totaln = sum(n),
         correct = ifelse(fnvote == pred, 1, 0)) %>%
  filter(correct == 1) %>%
  summarize(accuracy = sum(n)/first(totaln)) %>%
  separate(category, c("Data", "Model"), sep=": ") %>%
  slice(5,4,6,2,1,3)  -> assessmodel

Output:

# A tibble: 4 x 3
  Data     Model         accuracy
  <chr>    <chr>            <dbl>
1 Training Random Forest    0.920
2 Testing  Random Forest    0.911
3 Testing  Decision Tree    0.913
4 Training Decision Tree    0.922

Generally speaking, a model is acceptable if its accuracy is over 80%. Both models are actually performing well to predict the FN vote values in France.