#### Venera Tomaselli<sup>a</sup> , Giulio Giacomo Cantone<sup>b</sup> **Multipoint vs slider: a protocol for experiments**

**Multipoint** *vs* **slider: a protocol for experiments** 

<sup>a</sup> Department of Political and Social Sciences, University of Catania, Italy <sup>b</sup> Department of Physics and Astronomy "Ettore Majorana", University of Catania, Italy Venera Tomaselli, Giulio Giacomo Cantone

### **1. Introduction**

Since the 1990s, in all fields involving survey tools aimed at collecting data from a sample of a target population, computer-assisted technologies of data recording replaced the old *paper-&-pen*. The speed of technological shift was not paired by methodological innovations.

Multipoint scales, indeed, are still among the most employed numerical (or semantic) supports for many variables in psychological, health, socio-economic research, and even in engineering (e.g., user experience design). With the spread of 'Big Data', an old issue in statistical measurement gained a new relevance. It can be shortly summarized: tons of Big Data from self-reports of taste and perception are recorded every day. While these data are reported through multipoint scales, almost all the relevant inferences are made through families of methods with parametric assumption, for example, one of the most notorious methodology to infer human preferences through analysis of similarity, *collaborative filtering* (Kluver, Ekstrand, and Konstan 2018).

The debate about the plausibility of an estimation of central value in ordinal variables (which is the core of the debate about parametric methods for analysis of 'ratings') is well summarised by Velleman and Wilkinson (1993). Kampen and Swyngedouw (2000) expanded the issue relating it the consequential debate about derivative measures of association and correlation among variables (also, see, Agresti 2010). Tomaselli and Cantone (2020) highlighted a more recent issue in data analysis: when the number of items compared (e. g, a ranking) exceeds too much the categories of the supporting ordinal scale, the comparison is made impossible by the high amount of tie cases. Therefore, statistics constrained in the support scale (i.e., the median) are unfeasible to index distributions from very large samples, or populations. This problem of ranking statistics could be interpreted as an extreme case of 'ceiling effect' (Austin and Brunner 2003).

Slider scales, which are technological advancements not previously available on paper-&-pen survey but now enhanced by surveying with web tools, can overcome the issues of ordinal scales. A slider scale ('slider') is a bar representing a visually continuous segment of numerical points through 1 to *m* (sometimes through 0 to *m*, or to -*m* to *m*). While the number of points is finite, for any analytical purpose this measurement is considered continuous and not ordinal, therefore *m* should not be a small number. A very common case is *m* = 100.

The respondent moves an indicator ('it slides') among the values in the bar. If the bar is drawn on a paper, as in the case for Visual Analogue scales (VAS), the respondent can only appoint a mark on the bar. The estimate of VAS may be considered continuous, and more accurate than multipoint scales (Voutilainen *et al*. 2016), but the value would be technically harder to record. For years the absence of proper computing, visualizing, and recording technologies impacted the developments of statistical science. Could multipoint and Likert scales be reputed obsolete because they were designed for *paper-&-pen* data collection? Results from Fryer and Nakao (2020) validate this thesis, while a web experiment by Funke (2015) criticizes sliders. Other results (see, Roster, Lucianetti, and Albaum 2015; Bosch *et al*. 2018) bring further arguments on the evaluation of sliders, in particular reporting a longer time of completion of tasks. A comprehensive review of the debate is provided by Chyung *et al*. (2018).

Matejka *et al*. (2016) performed an experiment testing the accuracy of sliders compared to a Likert scale and on the impact of marks with percentages ('ticks') on the bar of sliders. Participants

Venera Tomaselli, Giulio Giacomo Cantone, *Multipoint vs slider: a protocol for experiments*, pp. 91-96, © 2021 Author(s), CC BY 4.0 International, DOI 10.36253/978-88-5518-304-8.19, in Bruno Bertaccini, Luigi Fabbris, Alessandra Petrucci, *ASA 2021 Statistics and Information Systems for Policy Evaluation. Book of short papers of the opening conference*, © 2021 Author(s), content CC BY 4.0 International, metadata CC0 1.0 Universal, published by Firenze University Press (www.fupress.com), ISSN 2704-5846 (online), ISBN 978-88-5518-304-8 (PDF), DOI 10.36253/978-88-5518-304-8

<sup>79</sup> Venera Tomaselli, University of Catania, Italy, venera.tomaselli@unict.it, 0000-0002-2287-7343

Giulio Giacomo Cantone, University of Catania, Italy, giulio.cantone@phd.unict.it, 0000-0001-7149-5213

FUP Best Practice in Scholarly Publishing (DOI 10.36253/fup\_best\_practice)

(n = 2000) were recruited through *Amazon's service Mechanical Turk*. Participants were asked to estimate the blackness of a shade of grey through sliders or Likerts. Results show that sliders without ticks have better performances in both accuracy of the judgements and bias reduction. Even if authors do not mention it directly bias observed in their results is coherent with the psychological phenomenon of *heaping*, a connection rarely mentioned (an exception: Couper *et al*. 2006).

To monitor heaping effects is important because, while in scales with ticks heaping is due to psychological attachment, there is evidence that heaping is also related to fabricated data in data collections (Finn and Ranchos 2015).

# **2. Experimental protocol**

The sample of respondents is recruited through a web open procedure, like the aforementioned Mechanical Turk. The survey tool is therefore a website. The data collection process is segmented in 3 phases. After completion of 1st phase, a new record is added to a connected database while 2nd and 3rd phases add more data to the record.

In the 1st phase participants are randomly assigned to two random treatment groups. Both the groups are assigned to a task or 'trial': they have to estimate the colour of a square. This trial is repeated for 10 times. The treatment difference among the two groups is that the control group has to estimate the colour through a 0-10 multipoint scale, while the experimental group has to estimate it through a 0-100 slider bar.

As showed in Matejka *et al*. (2016), estimation of shades of colours through a sequence of trials is among the best for objective evaluation of measurement tools (i.e., scales). Instead of presenting to respondents 50 fixed shades of grey squares, we propose a random generator of a shades of Red and Blue. A square of Yellow is superimposed with an opacity randomly distributed between 0% and 10%. Therefore, any randomly coloured square is a realization of the combination of: (i) a randomly generated parameter ξ of shade, uniformly distributed between 0% (full Red) and 100% (full Blue) and (ii) a randomly generated parameter ζ of noise, uniformly distributed between 0% and 10%.

In the 1st phase participants are requested to estimate only shade, with opacity being a possible factor of controlled noise. In the original experiment of Matejka *et al*. there was no mechanism to control noise in the estimation process, even if authors accounted that differences in participants' devices should have been factors of noise out of experimental control. Another difference from Matejka *et al*. is that participants should be free to refuse to complete any trial. The default option in a Likert scale, signalled through a button under (not adjacent) the multipoint scale, is 'no answer'. The best equivalent to let "no answer" in a slider would be setting invisible the indicator on the bar before interaction to it, providing a button 'no answer' to remove it again. This does not push a heaping bias inflation towards initial positions of the indicator (Liu and Conrad 2018). In this case, if the respondent avoids interacting with the slider, a 'no answer' is recorded.

The software must record not only the final choice of the participants but also every single interaction with the tool, tracing their decisional process. Continuous sliders are very well suited for this tracing because there is a large support of values to pick on.

When a participant completes 1st phase, data recorded is: (i) random generated shade parameter ξ for the 10 trials; (ii) random generated opacity parameter ζ for the 10 trials; (iii) participant's estimations *x* for the 10 shades; (iv) time of completion *t*<sup>x</sup> for each of the 10 trials; (v) number of clicks *k*<sup>x</sup> for each of the 10 trials.

In the 2nd phase participants are asked to report their *taste-*response of 10 well-known leisure products through the scale (to rate) of their treatment groups in 1st phase. When the participant completes the 2nd phase, further information can be added to the record: (vi) participant's rating *r* for each of the 10 products; (vii) time of completion *t*<sup>r</sup> for each of the 10 ratings; (viii) number of clicks *k*<sup>r</sup> for each of the 10 trials. If the rating process is interrupted, no data is added to the record.

In the 3rd phase standard demographic variables are collected from participants, whereas they provide consent.

### **3. Methods of data analysis**

*Heaping* is a relevant bias in applied statistical studies on scales of measurement. Even if they do not mention it directly, the statistic adopted in Matejka *et al*. (2016) to measure heaping is a normalised score of the mean deviation from the expected difference of observed frequency among adjacent values:

$$\sqrt{\frac{\Sigma \left( (n\_{\mathcal{X}} - n\_{\mathcal{X}-1}) \frac{\sum n\_{\mathcal{X}} - n\_{\mathcal{X}-1}}{|\mathcal{M}|} \right)^{2}}{|\mathcal{M}|}} \tag{1}$$

where |*M*| is the cardinality of the support, *x* is the observed value from the M scale and *n* is the absolute frequency associated to *x* 1 . Matejka *et al*. reported a score of heaping ~ 2 (± 0.1 at CI 95%) for sliders, while the introduction of 'ticks' that imitate multipoint scales in the slider significantly increases the heaping bias (Fig 1, see "no ticks"). The relation is not linear to the number of ticks.

**Figure 1.** Mean heaping scores for varying number of tick marks. Error bars show 95% CIs. (Matejka *et al*., p. 5)

We make the hypothesis that control group (multipoint) induces more heaping than experimental group (sliders).

Since values (*x* for estimates of shades, *r* for ratings on products) from sliders and multipoint scales are constrained in a finite support, they can be normalised into a [0,1] interval. The distribution of errors ξ – *x* is the main statistic and is assumed to be normally distributed. A Shapiro-Wilk test is performed on the sample of ξ – *x* values of all the trials *per* group to confirm this assumption. Since noise factors ζ are all sampled from the same population, we expect no significant difference in the distribution of values. This assumption is tested through a Kolmogorov-Smirnov test. If violated, ξ – *x* values will be controlled *per* ζ. Times of completion *t*<sup>x</sup> are assumed to be normally distributed. This assumption is tested through a Shapiro-Wilk test.

Null hypotheses on the *objective task* of shade estimation with random noise are:

i. sliders induce a distribution of mean absolute errors (MAE) from randomised parameters over the 10 trials which is not superior to multipoint scales' MAE.

Absolute errors | ξ – *x* | are never assumed to be distributed normally: if ξ – *x* values were

 <sup>1</sup> Is there an implicit consensus of statistical science on this measure? Roberts and Brewer (2001) provide 2 different approaches to measure heaping: (i) H1 is technically only a minor improvement over (1) while (ii) C2 is based on the probability to observe local modes. The (ii) approach raises issues on the confidence threshold to assert that an observed local mode is *likely a true* local mode and not a local noise. For a modern approach to heaping models, see Zinn and Würbach (2015).

normally distributed, then their absolute values would be distributed as half-normal distribution (Folded Normal). Given the structure of the hypothesis, a non-parametric 1 tailed test (i.e., Mann-Whitney test) on the samples of participants' MAE in the two groups (a MAE *per* participant) seems suited to check the hypothesis.


Correlations between degrees of controlled noise ζ, errors ξ – *x*, times of completion *t*x, and clicks *k*<sup>x</sup> are graphically represented through scatterplots and visualised through a generalised model if the fit is sufficiently good. The effect of noise on ξ – *x* is supposed to be non-linear and possibly not even symmetrical around the value of ξ - *x* = 0, although it can be symmetrical around a different value. Noise can similarly affect *t*<sup>x</sup> and *k*x, too.

Does the same structure of hypotheses A, B, and C hold for measures collected in 2nd phase? Since the 10 leisure products have to be chosen among well-known, a prior value *ρ* of expected taste can be elicited through an expected value computed from rating statistics of online rating platforms. Although arguably biased for both small and large samples (Askalidis, Kim, and Malthouse 2017), these priors are likely the most reliable predictors of expected *taste* at least from a population of subjects very interested in the product category<sup>2</sup> .

Even accounting for aforementioned biases, the statistic *r* - *ρ* can be interpreted as a *deviation* of biased raters *vs.* randomised raters. Even if | *r* - *ρ* | and | ξ – *x* | are technically the same operation of *distance*, their arguments are conceptually distinct, as reflected through the order of minuends and in the semantic difference between an *error* (there is always a true parameter ξ) and a *deviation* (two procedures to evaluate the same *evaluando*). As a consequence, the hypotheses on *r* – *ρ* cannot be 1-tailed. However, although *tastes* are not *objective*, hypotheses on the differences in values, variances, and skewness among groups can still be asserted.

Moreover, means of *r* - *ρ* values can be both correlated and compared to paired (intraparticipant) means of ξ – *x* values (controlled on ζ). Correlating and comparing times of completions (*t*<sup>x</sup> with *t*r) and clicks (*k*<sup>x</sup> with *k*r) is even less ambiguous since they measure both the same physical quantities. Differences and ratios between the two phases can be compared *per* group, too.

Finally, whereas the sample sizes on demographics collected in 3rd phase support it, associations between demographic variables to aforementioned statistics can be asserted as a control procedure but no causal explanation emerges from literature about trials on the colour perception.

### **4. Conclusions**

While this protocol partly replicates the experiment of Matejka *et al*. (2016), we propose some relevant improvements to define a general experimental protocol for data collection and analysis on web-tool of human perception and tastes:

 <sup>2</sup> For example, the rating platform *Letterboxd* reports that the movie *The Godfather* (directed by Francis F. Coppola, released in 1972) received more than 300,000 ratings from all over-the-world raters. According to Lorenz (2006) even in presence of local peaks, the best models to represent movies have only one location parameter, which the author interprets as an "evidence of universality in processes of continuous opinion dynamics about taste" (p. 251).


The major rationale to adopt sliders has sprouted from the theoretical debates mentioned in Section 1, so far. For applied research, even in absence of evidence of remarkable improvements (see, hypotheses A, B, and C in Section 3) in the reduction of coarseness in data, inaccuracies of self-report, and biases through adoption of sliders, the evidence that sliders reduce scale-induced heaping (Figure 1) is extremely insightful. Better measurement scales can minimize the confounding effect in those research programmes aimed to investigate data fabrication (i.e., fraud reports) through tests on heaping.

# **References**

Agresti A. (2010). *Analysis of Ordinal Categorical Data*, Wiley, Hoboken, (NJ).

