Predictive Modeling: FN Voters in France (Decision Trees and Random Forests)
My predictive modeling skills have been mostly theoretical as I rarely took the time to practice them on real data. That’s why I decided to run this analysis and familiarize myself with the different concepts.
I’ll be using data from the European Social Survey to build a decision tree and a random forest model of the Front National vote (now Rassemblement National) in France. These are the different variables that I’ll be including in the analysis:
fnvote: respondent voted for FN (0/1)
lrscale: Left-right ideological scale (1:10)
stfgov: Satisfaction with the national government (1:10)
imwbcnt: Attitudes towards immigration (1:10)
rlgatnd: Religious services attendance (1:7)
pray: Prayer apart from religious services (1:7)
agea: Age in years
eduyrs: Education in years
hinctnta: Total household income (percentiles)
Before moving on to cleaning and organizing data, I need to start with a caveat. When it comes to predictive analytics, the more data the better. However, the sample size that I’m using is considered small despite merging several waves together. That’s why I’ll be interpreting the results with caution.
fr08 = ESS4 %>% filter(cntry=="FR") fr10= ESS5 %>% filter(cntry=="FR") fr12= ESS6 %>% filter(cntry=="FR") fr14= ESS7 %>% filter(cntry=="FR") fr16= ESS8 %>% filter(cntry=="FR") fr18= ESS9 %>% filter(cntry=="FR") fr08 = fr08 %>% rename(votept = prtvtbfr) %>% mutate(fnvote = ifelse(votept==2, 1, 0)) fr10 = fr10 %>% rename(votept = prtvtbfr) %>% mutate(fnvote = ifelse(votept==2, 1, 0)) fr12 = fr12 %>% rename(votept = prtvtcfr) %>% mutate(fnvote = ifelse(votept==2, 1, 0)) fr14 = fr14 %>% rename(votept = prtvtcfr) %>% mutate(fnvote = ifelse(votept==2, 1, 0)) fr16 = fr16 %>% rename(votept = prtvtcfr) %>% mutate(fnvote = ifelse(votept==2, 1, 0)) fr18 = fr18 %>% rename(votept = prtvtdfr) %>% mutate(fnvote = ifelse(votept==11, 1, 0)) fr08 = fr08 %>% select(lrscale, stfgov, imwbcnt, rlgatnd, pray, gndr, agea, eduyrs, hinctnta, votept, fnvote) fr10 = fr10 %>% select(lrscale, stfgov, imwbcnt, rlgatnd, pray, gndr, agea, eduyrs, hinctnta, votept, fnvote) fr12 = fr12 %>% select(lrscale, stfgov, imwbcnt, rlgatnd, pray, gndr, agea, eduyrs, hinctnta, votept, fnvote) fr14 = fr14 %>% select(lrscale, stfgov, imwbcnt, rlgatnd, pray, gndr, agea, eduyrs, hinctnta, votept, fnvote) fr16 = fr16 %>% select(lrscale, stfgov, imwbcnt, rlgatnd, pray, gndr, agea, eduyrs, hinctnta, votept, fnvote) fr18 = fr18 %>% select(lrscale, stfgov, imwbcnt, rlgatnd, pray, gndr, agea, eduyrs, hinctnta, votept, fnvote) fr = rbind(fr08, fr10,fr12,fr14,fr16,fr18) fr = fr %>% mutate(female = ifelse(gndr == 2, 1, 0)) fr = select(fr, -votept)
First, I’m going to split the data frame into a training set and a testing set. The proportion default is 3/4 but I’m going to use the 80/20 rule for this analysis.
fr %>% resample_partition(p = c(test = 0.2, train = 0.8)) -> frpart frpart %>% pluck("train") %>% as_tibble() -> training frpart %>% pluck("test") %>% as_tibble() -> testing
Decision Tree:
dtree <- rpart::rpart(as.factor(fnvote) ~ ., data = training) plot(partykit::as.party(dtree))

Random Forest:
rforest <- randomForest(as.factor(fnvote) ~ ., data=training, ntree=500, mtry = 3, oob_score = TRUE, na.action = na.exclude) importance(rforest) %>% tbl_df() %>% mutate(var = rownames(importance(rforest))) %>% select(2, 1)
Results:
# A tibble: 10 x 2 var MeanDecreaseGini <chr> <dbl> 1 lrscale 97.3 2 stfgov 76.4 3 imwbcnt 103. 4 rlgatnd 42.6 5 pray 45.2 6 gndr 13.3 7 agea 138. 8 eduyrs 92.0 9 hinctnta 76.2 10 female 13.2
The decision tree model suggests that respondents’ vote for the FN can be mostly predicted by their age, ideology, and attitudes towards immigration. This is also confirmed by the random forest model. The previous output summarizes the results from the random forest by the importance of variables. The mean decrease indicates the average of the total decrease in node impurity weighted by the proportion of samples reaching that node.
Model Assessment
The most efficient way to evaluate the performance of a model is by assessing its accuracy. Simply put, it calculates whether the model made correct predictions.
training %>% mutate(pred = predict(dtree,type="class", newdata=.)) %>% select(fnvote, pred) %>% mutate(pred = as.numeric(pred)-1) %>% mutate(category = "Training: Decision Tree") -> traindtree training %>% na.omit %>% mutate(pred = rforest$predicted) %>% select(fnvote, pred) %>% mutate(pred = as.numeric(pred)-1) %>% mutate(category = "Training: Random Forest") -> trainrforest testing %>% mutate(pred = predict(dtree,type="class", newdata=.)) %>% select(fnvote, pred) %>% mutate(pred = as.numeric(pred)-1) %>% mutate(category = "Testing: Decision Tree") -> testdtree testing %>% mutate(pred = predict(rforest, newdata=.)) %>% select(fnvote, pred) %>% mutate(pred = as.numeric(pred)-1) %>% mutate(category = "Testing: Random Forest") -> testrforest bind_rows(traindtree, trainrforest, testdtree, testrforest) %>% group_by(category, fnvote, pred) %>% count() %>% na.omit %>% group_by(category) %>% mutate(totaln = sum(n), correct = ifelse(fnvote == pred, 1, 0)) %>% filter(correct == 1) %>% summarize(accuracy = sum(n)/first(totaln)) %>% separate(category, c("Data", "Model"), sep=": ") %>% slice(5,4,6,2,1,3) -> assessmodel
Output:
# A tibble: 4 x 3 Data Model accuracy <chr> <chr> <dbl> 1 Training Random Forest 0.920 2 Testing Random Forest 0.911 3 Testing Decision Tree 0.913 4 Training Decision Tree 0.922
Generally speaking, a model is acceptable if its accuracy is over 80%. Both models are actually performing well to predict the FN vote values in France.