#### Giuseppe Bove **Measures of interrater agreement when each target is evaluated by a different group of raters**

**Measures of interrater agreement when each target is evaluated by a different group of raters** 

> Dipartimento di Scienze della Formazione, Università Roma Tre Giuseppe Bove

# **1. Introduction**

Measures of interrater agreement like *kappa* of Cohen (and its weighted versions) and intraclass correlations are usually defined for ratings regarding a group of targets (subjects or objects), each rated by the same group of raters. This happens when the agreement among clinical diagnoses provided by more physicians on the same set of patients is analysed for identifying the best treatment for the patients, or when the agreement among ratings of educators who assess on a new ordinal rating scale the language proficiency of a corpus of argumentative (written or oral) texts is considered to test reliability of the new scale.

In other situations, the agreement between ratings is analysed in a group of targets where each target is evaluated by a different group of raters, like for instance when teachers in a school are evaluated by a questionnaire administered to all the pupils (students) in the classroom. In these situations, it is important to analyse the reliability of the judgments by a measure of agreement between ratings, butsince the ordering of the ratings assigned to each target is irrelevant, the measure can only be defined starting from the single target level.

In this paper, an index is proposed to evaluate the agreement between raters for each single target rated on an ordinal scale, and to obtain also a global measure of the interrater agreement for the whole group of targets evaluated. The main features of the proposal will be illustrated in a study for the assessment of the behaviour of student teachers in the classroom. Data were collected in a research conducted in 2018 at Roma Tre University with students of the degree course in Formazione Primaria, during their experience of internship ("tirocinio").

# **2. Target-specific measures of interrater agreement**

When ratings provided on a quantitative (interval or ratio) scale are analysed in a group of targets where each target is evaluated by a different group of raters, a first approach available to measure the level of agreement for the whole group of targets is based on the ANOVA one-way random model (e.g., Shrout & Fleiss, 1979, McGraw & Wong, 1996). The intraclass correlation (ICC) for this model is the between-target variance divided by the sum of the between-target variance and the error variance (this sum is the ratings total variance). A high value of ICC indicates a good agreement among raters, because it is obtained when the between-target variance exceeds the error variance (that includes the within-target variance) by a wide margin. However, a low ICC value is not necessarily an indication of poor agreement, because a severe restriction in the range of ratings assigned in good agreement by the raters can cause low values of the between-target variance and low values of the ICC (the restriction of variance problem, LeBreton et al., 2003).

To overcome this problem of the ICC, target-specific measures of interrater agreement were proposed to work separately with each target *i* in the corresponding row of ratings in the targets × raters data matrix. James et al. (1984) proposed the index

$$r\_{WG,i} = 1 - \frac{\mathbf{s}\_i^2}{\sigma\_E^2}.$$

Giuseppe Bove, Roma Tre University, Italy, giuseppe.bove@uniroma3.it, 0000-0002-2736-5697 Referee List (DOI 10.36253/fup\_referee\_list)

FUP Best Practice in Scholarly Publishing (DOI 10.36253/fup\_best\_practice)

Giuseppe Bove, *Measures of interrater agreement when each target is evaluated by a different group of raters*, © Author(s), CC BY 4.0, DOI 10.36253/979-12-215-0106-3.28, in Enrico di Bella, Luigi Fabbris, Corrado Lagazio (edited by), *ASA 2022 Data-Driven Decision Making. Book of short papers*, pp. 157-162, 2023, published by Firenze University Press and Genova University Press, ISBN 979-12-215-0106-3, DOI 10.36253/979-12-215-0106-3

where <sup>2</sup> is the observed variance of the ratings in profile *i,*  <sup>2</sup> is the variance obtained from a theoretical null distribution representing a complete lack of agreement among raters (e.g., the uniform distribution). For raters in perfect agreement, we have <sup>2</sup> = 0, with a corresponding value , = 1. For a total lack of agreement, the observed variance approaches the variance obtained from the theoretical null distribution. This leads , to approach 0.

A global measure of agreement for the whole group of targets can be defined as the arithmetic average of the , values( ̅ <sup>=</sup> <sup>1</sup> ∑ , =1 ). The accuracy of the index depends strongly on the specification of the null distribution, and negative values could be obtained. Other possible indices for quantitative scales are reviewed, for instance, in LeBreton & Senter (2008). Recently, Bove (2022) has considered the normalised standard deviation and the coefficient of variation as possible alternatives to ICC and ,.

All the approaches described regard quantitative scales and are not appropriate for ordinal and nominal scales. Most of the indices of interrater agreement proposed for ratings on an ordinal scale (frequently averages of the weighted *kappa* of Cohen calculated for each of the possible pairs of raters) are not suitable for ratings regarding a group of targets, each rated by a different group of raters.

In order to propose a new index of interrater agreement for ordinal scales, the representation of the profile of the ratings for target *i* on a *K*-level ordinal scale in Table 1 is considered,


**Table 1** – Profile of the ratings for target *i* on a *K*-level ordinal scale

where, is the number of raters assigning level *k* to target *i* and is the number of raters that rate target *i*. We propose a general approach that defines target-specific interrater agreement indices as normalised indices of variability for the distribution in profile *i,* according to the measurement level of the scale. A global measure of agreement can be defined as the arithmetic average of the targetspecific values of the indices.

 So, for ordinal scales, the following index of interrater agreement can be considered (analogous with the measure of dispersion for ordinal variables, e.g., Leti, 1983),

$$\delta\_i = 1 - \frac{D\_i}{D\_{\max}} = 1 - \frac{2 \sum\_{k=1}^{K-1} F\_{ik} (1 - F\_{ik})}{D\_{\max}}$$

where is the cumulative proportion associated with level *k* of the scale in the response profile *i*, for *k*=*1,2,….,K*, is the maximum of = 2 ∑ (1 − ) −1 =1 , and it is <sup>=</sup> ( −1 <sup>2</sup> ) as is even, and =( −1 <sup>2</sup> )(1 <sup>−</sup> <sup>1</sup> 2) as is odd.

The index is always nonnegative, it is = 1 in the case of maximum agreement and = 0 in the case of maximum disagreement. Some simulations and experiences with real applications suggest the following thresholds for the interpretation of the values assumed by the index: values lower than 0.6 indicate low to moderate agreement, values between 0.6 and 0.8 good agreement, above 0.8 excellent agreement. The index allows for the identification of particular targets for which agreement is low: this is not possible with measures like *kappa* or intraclass correlations. Besides, a global measure of agreement can be defined as the arithmetic average of the values obtained for the *N* targets (̅= <sup>1</sup> ∑ =1 ). The index is not affected by the possible concentration of ratings in a few levels of the scale, like it happens for the measures based on the ANOVA approach or for the *kappa*-type indices, and it does not depend on the definition of a null distributions like ,.

In the next section, an application will be shown in which teachers in a school are evaluated by a questionnaire administered to all the pupils in the classrooms, so each teacher is evaluated by a different group of pupils. In this situation, it is interesting to analyse the level of dispersion of the ratings in the classrooms with respect to each question of the questionnaire, in order to investigate aspects of rating's reliability. Then, a matrix Δ = () is defined where each row corresponds to a teacher and each column to a question, and the entry is the value of computed in the classroom of teacher *i* for question *j* (an example is provided in Table 2). Entries of matrix Δ can be considered as similarities between teachers and questions. The values can be depicted in a diagram by the *unfolding* model (originally proposed by Coombs (1964) for rectangular matrices of preference scores). The model is

$$f\left(\delta\_{ij}\right) = p\_{ij} = \sqrt{\Sigma\_{s=1}^{t} \left(a\_{is} - b\_{js}\right)^2} + \varepsilon\_{ij},\tag{1}$$

where is a monotone transformation, mapping the similarities into a set of dissimilarities (e.g., = 1 − ), and are the coordinates respectively of row (teacher) *i* and column (question) *j* on dimension *s* in an *t-dimensional* space and is a residual term. It is worth to notice that the Euclidean distance model usually used in multidimensional scaling for square dissimilarity matrices (e.g., Borg & Groenen 2005) is a constrained version of model (1), because for each *j* it is required = .

So, a diagram for the pattern of relationships is obtained where each row (teacher) is represented as a point with coordinates and each column (question) as a point with coordinates . In the planar representation (*t*=2), the distance between row (teacher) *i* and column (question) *j*  approximates the corresponding dissimilarity (so, for instance, we can detect in the diagram both the teachers and the questions with low/high levels of agreement of ratings in the classrooms). Distances within each of the two sets of the row-points and the column-points are only implicitly defined and do not have corresponding observed entries in the data matrix. Parameters in the model (1) are estimated by iterative algorithms that, starting from initial estimates of <sup>0</sup> , <sup>0</sup> (*initial configuration*), iteratively decreases a least squares loss function moving vectors <sup>0</sup> = (1 <sup>0</sup> , 2 <sup>0</sup> , … . . , <sup>0</sup> ) and 0 = ( <sup>1</sup> <sup>0</sup> , <sup>2</sup> <sup>0</sup> , … . . , <sup>0</sup> ), until convergence to a minimum. An important point is picking a good initial configuration to avoid the problem of *local minima*.

# **3. Application**

A reduced version for pupils of the Teachers' Educational Practices Questionnaire (TEP-Q, Catalano et al., 2014) was administered to evaluate a group of 24 female student teachers of Roma Tre University, during their training (internship) in several primary schools of the Italian region Lazio, in school year 2018. The questionnaire consists of the following 12 questions regarding teachers behaviour in the classroom: "In the class she was relaxed" (Q1),"Before each activity, she clearly explained what we had to do" (Q2), "When someone approached her, she turn to look at him" (Q3), "She help us to repeat one thing better if we were not so clear" (Q4), "When someone of us was saying something, she interrupted him" (Q5), "When she talked to us, she also used gestures (for example, she moved her hands)" (Q 6), "She yelled at the class when she get angry" (Q7), "If someone of us needed to be consoled, she has noticed it, even if he did not tell her" (Q8), "During the activities she told us we could help each other" (Q 9), "When she was tired, she complained in class" (Q 10), "She made us do group work" (Q 11), "She praised us when we deserved it" (Q 12). Answers were provided on a 4-levels Likert scale (1=almost never, 4=almost always).

For each student teacher, ratings were obtained from the pupils in the classroom (24 school classrooms, 418 pupils, 204 females, 214 males, aged between 7 and 12 years). For each student teacher *i* and each question *j*, the value of the index was computed in order to analyse the reliability of the ratings provided by the pupils in the school classroom. Table 2 contains the matrix of the values and in addition, in the last row, the average ̅ . for each question.


**Table 2** – Values obtained for student teachers and questions in the twenty-four school classrooms.

Different levels of reliability characterize the twelve questions. Questions 2 and 10 have high values of the average index (0.86 and 0.79, respectively), that means the pupils usually agree in the responses (in several classrooms it is = 1). On the contrary, questions 6 and 9 have low values of the average index (0.39 and 0.43, respectively), that means the pupils frequently have different opinions about the aspects of teacher's behaviour considered in the two questions. The remaining questions show low to moderate levels of agreement in the pupil's responses (average values between 0.48 and 0.69).

It is also interesting to analyse the values of the index respect to each student teacher (rows of the matrix in Table 2). For instance, student teachers 10, 14, 19 and 21 have usually high levels of agreement between the pupil's responses in the twelve questions, on the contrary student teacher 20 has low values of agreement except for questions 2 and 10.

Model (1) was applied to analyse in a diagram the relationships between student teachers and questions. It is assumed = 1 − in model (1), this means that distances are inversely proportional to the values .

In Figure 1, the solution for *t*=2 dimensions is provided (*Stress-I*=0.29). Distances between student teachers and questions represent the level of agreement of the responses for the questions in the classroom (the lower the distance the higher the agreement). Question 2, question 10 and, to a lesser extent, question 1 are located in the centre of the diagram, close to many points representing teachers, because they have usually high levels of agreement in the responses of the pupils in the school classrooms. Questions 6, 9 and 8 have high heterogeneity in many cases, so they are positioned far apart from many student teachers. Considering the student teachers, we observe that student teacher 20 is far from most questions because she has usually low values of agreement for the ratings obtained in her classroom. On the contrary, student teachers 10, 14 and 21 are near the centre of the diagram and close to many questions, a consequence of the homogeneity of ratings obtained on many questions.

**Figure 1:** Unfolding of the values for student teachers (empty circles) and questions (full black) in Table 2 (the higher the smaller the distance)

# **4. Conclusion**

A descriptive approach has been presented for the analysis of the agreement in ratings given to a group of targets, where each target is evaluated by a different group of raters. An index of interrater agreement defined at the single target level is proposed for ratings given on an ordinal scale, in a manner similar to the definition of the , index for ratings on a quantitative scale. Besides, a measure of agreement for the whole group of targets is obtained as the average of the target-specific values. The index presents some advantages respect to the methods based on ANOVA mean squares like intraclass correlation, and respect to many *kappa*-type indices. Besides, when the index is computed for a group of targets and more questions, it is shown that an unfolding model allows to analyse in a diagram the matrix of the values of the index obtained for each target-question pair.

The index proposed is mainly considered as a measure of size of the interrater agreement, therefore developments of this research may concern: 1) an accurate definition of reliable thresholds useful for the interpretation of the level of agreement in the applications; 2) the study of the sampling properties of the index.

# **References**

