#### Giulia Vannucci <sup>a</sup> , Anna Gottard <sup>a</sup> , Leonardo Grilli <sup>a</sup> , Carla Rampichini <sup>a</sup> <sup>a</sup> Department of Statistics, Computer Science, Applications "G. Parenti" **Random effects regression trees for the analysis of INVALSI data**

Random effects regression trees for the analysis of INVALSI data

& Florence Center for Data Science, University of Florence, Florence, Italy Giulia Vannucci, Anna Gottard, Leonardo Grilli, Carla Rampichini

#### 1. Introduction

Multilevel data structures, where data are typically clustered in nested levels, are common in many fields. An emblematic example consists of students, that are grouped in classes and schools (individual cross-sectional data) or children growth evaluated at several time points (repeated measures). Multilevel data require specific models referred to as *multilevel*, *random effects* or *mixed* (Snijders and Bosker, 2012).

Model specification is a challenging task in mixed models. Typically, a linear model is assumed, although non-linearities and interaction effects are undeniably of interest. A worthwhile approach exploits regression trees and the CART algorithm (Breiman et al., 1984) to capture non-linearities and high-order interaction effects. In particular, regression trees are a statistical learning algorithm that shapes the regression function as piece-wise constant over a recursively found partition of the covariate space. The graphical display of the recursive partition provides an easy interpretation of this predictive algorithm. The procedure, however, assumes statistical units to be independent, which is not the case of clustered data.

Regression trees have been extended to clustered data by Hajjem et al. (2011), who proposed to model fixed effects with a decision tree while accounting for random effects via a linear mixed model in a separate, subsequent, step. In particular, they first apply the CART algorithm as if data were not clustered to estimate the fixed effects. It is shown that random effect regression trees are less sensitive to parametric assumptions and provide improved predictive power compared to linear models with random effects and regression trees without random effects. The literature has thereon grown with variants and extensions. Among others, see Sela (2012); Hajjem et al. (2014); Miller et al. (2017).

In this work, we propose a further variation of the mixed effects regression tree, where the fixed and the random part parameters are estimated jointly, using a backfitting algorithm. To ease the interpretation, our proposal incorporates a linear component additively to the regression trees. Consequently, the general trend of dependence is captured by the linear component, while the tree part captures interactions and non-linearities.

The proposed algorithm is then applied to data collected by the national institute for the evaluation of the educational system and training (INVALSI: Istituto Nazionale per la VALutazione del Sistema educativo di Istruzione e di formazione) in Italy. The study aims to compare schools' educational effectiveness impartially by measuring students' progress over their careers. We focus on test scores in Mathematics, given some characteristics of the school and the pupil. The proposed model is able to take into account the student clustering in schools and to capture interesting interactions between student-level covariates and school-level covariates.

The rest of the paper is organised as follows. Section 2 illustrates the model proposed, together with the backfitting algorithm. Section 3 describes the application of the proposal to INVALSI data. A brief section of final remarks concludes the paper.

21 Giulia Vannucci, University of Florence, Italy, giulia.vannucci@unifi.it, 0000-0003-3569-6274 Anna Gottard, University of Florence, Italy, anna.gottard@unifi.it, 0000-0002-8246-4962 Leonardo Grilli, University of Florence, Italy, leonardo.grilli@unifi.it, 0000-0002-3886-7705 Carla Rampichini, University of Florence, Italy, carla.rampichini@unifi.it, 0000-0002-8519-083X

FUP Best Practice in Scholarly Publishing (DOI 10.36253/fup\_best\_practice)

Giulia Vannucci, Anna Gottard, Leonardo Grilli, Carla Rampichini, *Random effects regression trees for the analysis of INVALSI data*, pp. 29-34, © 2021 Author(s), CC BY 4.0 International, DOI 10.36253/978-88-5518-304-8.07, in Bruno Bertaccini, Luigi Fabbris, Alessandra Petrucci, *ASA 2021 Statistics and Information Systems for Policy Evaluation. Book of short papers of the opening conference*, © 2021 Author(s), content CC BY 4.0 International, metadata CC0 1.0 Universal, published by Firenze University Press (www.fupress.com), ISSN 2704-5846 (online), ISBN 978-88-5518-304-8 (PDF), DOI 10.36253/978-88-5518-304-8

## 2. A tree embedded linear mixed model

We propose a random effect model, called *Tree Embedded Linear Mixed* (TELM) *model*, able to treat both non-linear and interaction effects and cluster mean dependencies. Motivated by the application of interest, we consider in particular a two-level random effect model. Hence, we will denote as *level 1 units* the statistical units (e.g. students) and *level 2 units* the groups (e.g. schools).

The model is a piecewise-linear regression function, consisting of the sum of a tree component and a mixed effect linear component. The proposal is the mixed effect version of the semi-linear regression trees (Vannucci, 2019). It can be ideally divided into three parts: a fixed effect linear part, a fixed effect non-linear part based on a tree and a random effect part. The resulting model can be formulated as

$$Y\_{ij} = \beta\_0 + \mathbf{X}\_{ij}\boldsymbol{\beta} + \mathbf{Z}\_j \boldsymbol{\gamma} + T(\mathbf{X}\_{ij}, \mathbf{Z}\_j) + U\_j + \epsilon\_{ij} \tag{1}$$

where Yij is the response variable for level 1 unit i belonging to level 2 unit j, β<sup>0</sup> is the (fixedeffect) regression intercept, Xij is the vector of the level 1 covariates, β the associated fixed effect coefficients, Z<sup>j</sup> is the vector of the level 2 covariates, γ the associated fixed effect coefficients. Here, T(Xij ,Z<sup>j</sup> ) is the tree based component depending on some or all the level 1 and the level 2 explanatory variables. Finally, U<sup>j</sup> ∼ N(0, σ<sup>2</sup> <sup>u</sup>) is the random intercept for level 2 unit j and ij ∼ N(0, σ<sup>2</sup> ) are the regression errors.

The model is additive in its components where the tree-component acts as a region-specific categorical variable. This can be seen in the following alternative specification

$$Y\_{ij} = \beta\_0 + \mathbf{X}\_{ij}\boldsymbol{\beta} + \mathbf{Z}\_j \gamma + \sum\_{m=1}^{M} \mu\_m \mathbb{I}\{ (\mathbf{X}\_{ij}, \mathbf{Z}\_j) \in R\_m \} + U\_j + \epsilon\_{ij},\tag{2}$$

where R1,...,R<sup>M</sup> is the partition of the predictor space corresponding to the tree-component. When the unknown regression function can be assumed to be quasi-linear (Wermuth and Cox, 1998), the number of leaf nodes M can be kept small to avoid overfitting.

To account for the contextual effects of level 1 predictors, we add the cluster mean W<sup>j</sup> = (1/n<sup>j</sup> ) <sup>n</sup><sup>j</sup> <sup>i</sup>=1 Wij to the set of level 2 predictors Z<sup>j</sup> (Snijders and Bosker, 2012).

An iterative, backfitting-like procedure obtains model fitting. First, the tree is initialised at the mean of the response variable and the partial residuals Y <sup>∗</sup> are computed by subtracting to Y the tree prediction. Secondly, a linear random intercept effect model is fitted on Y <sup>∗</sup> and explanatory variables at the individual and group level. The corresponding partial residuals Y ∗∗ are obtained by subtracting to Y model predictions. These partial residuals Y ∗∗ are employed in the next step to fit a new tree, using the CART algorithm (Breiman et al., 1984) with a short depth. We iterate alternating the two fitting steps until convergence is reached. At the end of the procedure, model (2) is fitted by a linear random effect model using the partition associated with the tree selected at convergence. The leaf node parameters µ<sup>m</sup> are estimated jointly with the other model parameters β0, β, γ, σ<sup>2</sup> <sup>u</sup>, σ<sup>2</sup> ε .

The main difference of our procedure with respect to previous proposals (Hajjem et al., 2011; Sela, 2012), is the inclusion of the linear component Xijβ + Zjγ in the random effect model (2). In the presence of quasi-linear relationships, this inclusion allows us to avoid overfitting and helps interpretation. Moreover, since the µ<sup>m</sup> are jointly estimated in the final step, standard hypothesis tests and confidence intervals can be used for model selection and evaluation, together with the mean squared error computed on a test data set for prediction accuracy evaluation.

## 3. Application: Invalsi tests in Italian schools

We apply the TELM model outlined in the previous section to data on students' achievement collected by INVALSI. The Institute yearly carries out standardised tests to assess students' achievement in mathematics and reading and evaluate the overall quality of the educational offering of schools and vocational training institutes. See Arpino et al. (2019) for a discussion on this set of data.

As an illustration, we are here focusing on data on students who participated in the Maths tests at 5th and 8th grades. Specifically, the dataset is obtained by linking data on students who attended the 5th grade in 2013-2014 with data on students who attended the 8th grade in 2016-2017. The number of students who participated on both occasions of the Maths test is 409 528. They are grouped into 5 773 schools. We aim to predict the Maths test score, while understanding which of the included variables may be associate to the final score. Table 1 lists the considered explanatory variables. As shown in the table, we include both student level and school-level covariates, denoted in (1) as Xij and Z<sup>j</sup> respectively. Among the school level variables, we consider, in addition, the average of 5th grade Maths test and the average of the Socio-economic status index for each school. We are denoting these variables CM MATH5 and CM SES.

Table 1: Student and school level variables (INVALSI data years 2014 and 2017).


The proposed model takes into account both linear and non-linear effects and can detect the presence of both within level and cross-level interaction effects. In particular, the tree component T(Xij ,Z<sup>j</sup> ) in (1) is modelling non-linearities and interactions at once via a piece-wise linear function. Estimates for model parameters are reported in Table 2, while the tree component is also illustrated in Figure 1. The two terminal nodes without label in the plot have been automatically set in the reference category.

Individual and school level covariates not selected by the algorithm in the tree component have the usual interpretation. For example, controlling for the model covariates, females have, on average, around 1.5 points less than males in the score of math at the 8th grade.

Besides the usual interpretation of the coefficients of the linear components, it seems here interesting to focus on the covariates selected in the tree component of the model, namely the math score at grade 5 (MATH5) and the geographical area of the school (AREA). In particular, the tree component algorithm splits the values of MATH5 into three intervals: below 33 (2% of the observations), between 33 and 72 (55%) and above 72 (43%). Moreover, the algorithm


Table 2: TELM model fitted on INVALSI data: parameter estimates, standard errors and t-test.

splits the schools into two groups depending on AREA: schools placed in North or Center Italy, and schools placed in South Italy and Islands. Thus, the algorithm suggests the presence of an interaction effect between these two variables, with the effect of AREA depending on the interval of MATH5 and vice versa. For example, for a pupil living in a region of NW of Italy, the expected difference with respect to a pupil with same characteristics living in the NE of Italy (baseline) is 2.4386 if MATH5< 33, it decreases to 0.4675 if 33 <MATH5< 73, and it rises up to 4.9494 if MATH5 ≥ 73.

Note that the ordinary mixed effect regression model, whose parameter estimates are reported in Table 3, is nested with the TELM model. The Likelihood Ratio test comparing these two models obtains a test statistic equal to 10168, with 4 degrees of freedom, in favour of the TELM model. The variation between the estimates in the two models is due to the inclusion of the tree component, that relaxes the assumption of linearity and includes interaction effects. An interesting variation concerns the AREA coefficients estimates. Ignoring the AREA and MATH5 interaction, and the MATH5 non-linearity, completely reverse the main effect of AREA for South and Islands.

Figure 1: Graphical representation of the tree component of TELM model in Table 2 (nodes with a label correspond to a parameter in the model; the proportions of level 1 observations at each node are: left white node 0.35, N1 0.20, N2 0.17, N3 0.26, N4 0.01, right blue node 0.01)

Table 3: Random intercept model fitted on INVALSI data: parameter estimates, standard errors and t-test.


# 4. Conclusions

Tree Embedded Linear Mixed (TELM) models extend random effect models by including both a linear component and tree component in the regression function. The proposal increases the flexibility and the predictive ability of ordinary random effects models by handling simultaneously linear and non-linear associations and interactions.

A TELM model has the following characteristics: (1) it can handle clusters with different numbers of observations (unbalanced clusters); (2) it allows the inclusion of level 1 and level 2 covariates in the splitting process; (3) it allows observation-level covariates to have random effects. Besides, our proposal extends random effect regression trees in two directions: (i) incorporating a linear component in the final random effect model, and (ii) allowing to take into account contextual effects of level 1 covariates.

The application on INVALSI data is an illustrative example of TELM models that shows how the inclusion of a tree component helps highlight cross-level interactions.

## References

