#### Giulio Giacomo Cantone <sup>a</sup> , Venera Tomaselli <sup>b</sup> <sup>a</sup> Department of Physics ans Astronomy "E. Majorana", University of Catania, Catania, IT; **Misinformation and Disinformation in Statistical Methodology for Social Sciences: causes, consequences, and remedies**

Misinformation and Disinformation in Statistical Methodology for Social Sciences: causes, consequences, and remedies

<sup>b</sup> Department of Political and Social Sciences, University of Catania, Catania, IT; Giulio Giacomo Cantone, Venera Tomaselli

# 1 Introduction: the replicability of the Social Sciences

This paper concerns the prevalence and the causes of low replication rates in Social Sciences. The aim is to frame unintentional errors as scientific misinformation, and questionable research practices as disinformation. In Section 3 is presented Multiverse Analysis, which helps the assessment of the uncertainty about scientific claims and reduces false discoveries.

In order to introduce the topic of replication rate in Science, it is important to clarify the epistemological conditions to claim a scientific result to be replicated:


A replicated scientific theory is a collection of connected claims that are, for most, individually replicated (Lakatos, 1976; Schmidt, 2009). A replication rate is the rate of replicated results given a grouping variable: an author, an institution, or a scientific field. High replication rates are observed in exact sciences. Often, these replications are implicit: after a few successful experiments, a scientific theory is applied to more complex theories or technologies. The application of a theory is an implicit process of scientific replication (Feigenbaum and Levy, 1996). Methods of Social Sciences are not exact but probabilistic, harder to reproduce (e.g. due to changes in society), and applications into social policies are more nuanced than the vertical integration of natural sciences into technology.

Often claimed causal effects in Social Sciences are just statistical artifacts. Even metaanalyses are biased by so-called 'publication bias' (Nissen et al., 2016). It has been empirically demonstrated, indeed, that not significant estimates are less likely to be published in scientific venues (van Zwet and Cator, 2021). Prof. Breznau's research group provided the same dataset to 73 independent teams of quantitative social scientists, for a total of 161 people. He asked them to estimate the effect of immigration rates on public support for welfare-oriented political agenda. A sample of n > 1, 200 estimate values for the effect has been drawn through this survey. Of the estimates, 25% were significantly negative, 17% significantly positive, and 57.7% of the times the specified model failed to reject the null hypothesis (Breznau et al., 2022). Impressively, based on this result, not only it is almost impossible to claim that a general effect exists, but even to fully deny it, because it is always possible to assert that an effect holds under specific conditions.

The U.S. Agency for Defense Advanced Research Projects (DARPA) understood the problem of traditional approaches for Meta-Analysis and Causal Inference and launched the Systematizing Confidence in Open Research and Evidence (SCORE) Project to understand how to

Giulio Giacomo Cantone, University of Catania, Italy, giulio.cantone@phd.unict.it, 0000-0001-7149-5213 Venera Tomaselli, University of Catania, Italy, venera.tomaselli@unict.it, 0000-0002-2287-7343 Referee List (DOI 10.36253/fup\_referee\_list)

FUP Best Practice in Scholarly Publishing (DOI 10.36253/fup\_best\_practice)

Giulio Giacomo Cantone, Venera Tomaselli, *Misinformation and Disinformation in Statistical Methodology for Social Sciences: causes, consequences, and remedies*, © Author(s), CC BY 4.0, DOI 10.36253/979-12-215-0106-3.10, in Enrico di Bella, Luigi Fabbris, Corrado Lagazio (edited by), *ASA 2022 Data-Driven Decision Making. Book of short papers*, pp. 53-58, 2023, published by Firenze University Press and Genova University Press, ISBN 979-12-215-0106-3, DOI 10.36253/979-12-215-0106-3

predict if a study is deemed to fail to replicate. Preliminary findings have not been rosy: with exception of Economics, social scientists believe that their own fields produce more not replicable claims than replicable ones, i.e. there are more false discoveries than not. Economics seems to suffer of overconfidence in itself (Gordon et al., 2020). These results came after a large study led by Brian Nosek that attempted to replicate 100 claims in Psychology journals: less than half passed a replication attempt (OPEN SCIENCE COLLABORATION, 2015). Journals with high bibliometric scores do not perform better than other sources: evidence is in the direction of zero or negative correlation between bibliometric performances (e.g. journal impact factor) and replication rates (Szucs and Ioannidis, 2017; Brembs, 2018; Camerer et al., 2018).

# 2 Misinformation and disinformation

Ioannidis (2005) summarised predictors of low replication rates: small sample sizes, small effect sizes, and more than one hypothesis being tested on the same sample. On top of this, he stresses the incentives to look for novel findings instead of replication studies, too. He claims that papers on new theories are always more cited than their replication attempts, even when replication is not attained! This is a case of misinformation: inaccurate claims spread more than their corrections. Disinformation is a distinct phenomenon, where false claims are justified through a process of fabrication (West and Bergstrom, 2021). It is not necessary to report *fake data* to fabricate a fake result. The insidious alternative is to *omit* observed results. This behaviour is called "hacking the science" in the scientific community, by analogy with the method of *bruteforcing* many random combinations of inputs until a singular desired outcome is achieved by chance, e.g. hacking a password (Imbens, 2021).

#### 2.1 Misinformation: is Duning-Kruger effect a statistical artifact?

It is commonly observed that the correlation between performance and self-assessment of performance is significantly negative. Since performance depends on skill, the theory of Duning-Kruger Effect (Kruger and Dunning, 1999) or DK, explains this correlation through the claim that unskilled people have a tendency to overestimate their own skills. The original study, with more than 8,000 citations, is foundational for modern Pedagogy. A concurrent to DK is the "better than average" theory (Krueger and Mueller, 2002), or BTA. It claims that all people have a tendency to self-assess their skills above the average, independently of their skill. These two theories can coexist but if BTA is true, then the DK effect is overestimated.

Consider the conservative case of two actors: one with a true skill score x<sup>1</sup> = 40 and the other with a true skill score x<sup>2</sup> = 60. Their average is x¯ = 50. Assume the claim of BTA: actor 1 and actor 2 have exactly the same model of assessment of self-score: they adopt the average plus an expected positive error ϵ<sup>+</sup>. In this case, it holds

$$\left|x\_1 - \left(\bar{x} + \epsilon^+\right)\right| > \left|x\_2 - \left(\bar{x} + \epsilon^+\right)\right|, \forall \epsilon^+ \tag{1}$$

where |x − (¯x + ϵ<sup>+</sup>)| is the absolute error between true skill and self-assessed skill. It follows that: even with absolutely no cognitive differences between classes of actors (i.e. ϵ<sup>+</sup> is unique across actors), the less skilled actor has a larger absolute deviation. In this case, even if DK is not true, then the parameter ϵ<sup>+</sup> would induce a negative correlation. With few generalisations it is shown that any model that parameterises the self-assessed score to µ<sup>X</sup> + ϵ<sup>+</sup>; ∀X : {x1, x2, x3, ..., xn} would lead into an artificial DK effect, even when DK is not true. The effect would hold even for normally distributed positive ϵ + actor.

A meta-analytical study that adopted advanced statistical techniques found that, given the observed scores in the literature, DK is likely to be a statistical artifice due to BTA (Gignac and Zajenkowski, 2020). Another study reports only partial support for a true DK effect while confirming BTA (Jansen et al., 2021). Here no information has been concealed or fabricated. The authors did not adopt any questionable research practices. They lacked the correct specification of their null model.

### 2.2 Disinformation: six degrees of separation and even more

The expression "small world" refers to a network where a part of the connections happens with a uniform probability, and another part happens with a higher probability to form triadic closures (fully connected triangles of nodes). As emergent propriety, small world networks have a "characteristic average path length" L: for any given node in the network, any other node can be reached only by crossing paths with an expected length equal to L, independently by the number of nodes in the network.

Formation and structure of small-world networks have been described in the Watts-Strogatz model (Watts and Strogatz, 1998), but the description of this network goes back to Milgram (1967). Indeed, the implicit claim of Milgram is that in modern societies (pre-Internet) there is a characteristic path length L between human connections and that L is relatively short. Curiously, the paper with the experiment that originated the catchphrase "six degrees of separation" (Travers and Milgram, 1969) has been published only 2 years after a theoretical paper (Milgram, 1967) claiming the emergency of L in human societies. Together, the two papers collected more than 13.000 citations and, a rare case for a social science theory, they inspired new ideas not only in business (marketing, etc.) but also in engineering (transports, etc.).

It was a surprise for Judith Kleinfield (2002) to discover that the paper presenting the actual report of the *in vivo* experiment of the theory (Travers and Milgram, 1969) is actually poor in terms of statistical results. 296 participants have been recruited for the study. Their task was to send a document to one of their pre-existing social ties with the final aim that this document could reach a specific male broker in Boston. These 296 participants have been sampled across three populations: not brokers in Nebraska, brokers in Nebraska, and brokers in Boston.

This stratification would have been helpful if just enough documents reached their final destination: only 214 original participants sent the document and only 64 documents reached Boston's broker, after s stages. Among these 64, the observed average path length l = 5.2. The territorial variable was the only statistically significant. The number 6 (degrees of separation) is never explicitly mentioned, however, in footnote 4 the authors mention that they adjusted l through a not better specified marginal distribution of probabilities of reaching the final node at s + 1 stage (see paramter Qi). In footnote 4, they claim a confidence interval for L between 5 and 7. Is there sufficient evidence for claiming that L exists? From the sample of not brokers from Nebraska, only 18 documents reached the destination, with l = 5.7. This result could be generalised to the U.S. population but the sample size would be small.

Kleinfield (2002) investigated Milgram's archives, looking for more. She only found concerning details:


#### 2.3 *p*-hacking

The first case study falls under the category of 'misinformation within science' because it regards how the reputation of theories spreads within science even when a new model has been proven more consistent. The second case study is different: researchers concealed results from their own research because these were inconclusive toward their hypothesis. This is relatable to the case of so-called p-hacking of the level of significance α for rejection of the null hypothesis in statistical testing. p-hacking is a fraud because it omits to report the number of tests attempted before reaching a statistically significant result in data analysis (Simmons et al., 2011; Head et al., 2015). p-hacking is typically done in two ways:

1. Parallel p-hacking: many tests are arranged on different samples of the same population. Each sample has a minimal size but it is large enough to be deemed credible by the typical reader. Once a positive outcome is seen, no further test is necessary. In the reported result of the study, the number of tested samples is omitted and only the one associated with p<α is reported. As a reference: if the parameter of the effect size is equal to 0 and the null hypothesis of the test is true; with α = .05, after 14 tests (Bernoulli trials of parameter α), the probability to see a p<α in at least a test is

$$\sum\_{k=1}^{14} \alpha \cdot (1 - \alpha)^{k-1} > .51 \tag{2}$$

following the geometric distribution of the Bernoulli trials1.

2. Sequential p-hacking: a multivariate dataset is collected and a hypothesis is formalised with a simple model. If the statistics of the model are not significant, then the specification of the model is trivially adjusted (e.g., control variables are added to the model, outliers are removed, data is pre-processed differently, etc.) until a random p<α is achieved. All of these operations are not reported. This is a fraudulent type of Hypothesising After Results are Known, or HARKing (Rubin, 2017).

## 3 Remedies: pre-registration and Multiverse Analysis

A possible remedy for *science hacking* is pre-registration, that is to record in a dedicated electronic archive an anonymous manuscript that details all the research questions and the methods of incoming research. This happens before the data collection, so in a peer-review authors can certify that their analysis is coherent with the original research design and that hypotheses are not drawn after knowing the sample statistics (Nosek et al., 2018). Pre-registration has two problems: (i) nothing prevents p-hacking a result, pre-registering its specification, then submitting the complete manuscript for peer-review (Yamada, 2018); (ii) it does not allow serendipitous discoveries incoherent with what is pre-registered (Simmons et al., 2021).

Looking back at the crowd-sourced estimation in Breznau et al. (2022), this approach is kindred to a meta-analytical paradigm called Multiverse Analysis: Gelman and Loken (2014) popularised the assumption that the robustness of a scientific model can be estimated through trivially altering its specification. They call "degrees of freedom of the researcher" the analytical choices in data analysis, e.g. the choice of a link function in binomial regression between *logit* and *probit*. Steegen et al. (2016) introduced the concept of the "multiverse" of a scientific claim. These degrees of freedom are the source of errors in estimation.

In particular, claims are formalised into models. Assuming that a *true parameter* θ of the model exists, given a dataset, exists a set <sup>Θ</sup><sup>j</sup> <sup>=</sup> {ˆθj} of estimates from different <sup>j</sup>-specifications

<sup>1</sup>The equivalent command in R language is pgeom(13,.05).

of the model such that each estimate ˆθ<sup>j</sup> sufficiently close to θ and E(ˆθ<sup>j</sup> ) = θ holds. How to draw a sample that is representative of Θ<sup>j</sup> in order to ascertain the uncertainty associated with the error of misspecification (model error)? Crowd-sourced estimation (Breznau et al., 2022) draws a random sample of specifications and estimates just by surveying experts. Instead, Multiverse Analysis draws a systemic (not random) sample Jˆ of specifications through mapping all the degrees of freedom of the researcher, e.g. inclusion/exclusion of control variables, operations in data pre-processing, modelling choices for overdispersion, etc. and combining them into Jˆ, that is the multiversal sample of specifications or just the "multiverse".

Multiverse Analysis assumes that measures of variability in the observed multiversal estimates <sup>ˆ</sup>θ<sup>j</sup>∈J<sup>ˆ</sup> are as much if not more informative than parametric or bootstrapped standard error or confidence intervals about the uncertainty involved in the estimation of θ (Young and Holsteen, 2017; Simonsohn et al., 2020). An interesting application of Multiverse Analysis is for checking the Janus effect (Patel et al., 2015), which is when in the same multiverse co-exist statistically significant ˆθ<sup>j</sup> , but with different signs. Janus Effect is a red flag in the sample of so-called parametric type S error (Gelman and Tuerlinckx, 2000).

## References


Kleinfeld, J. S. (2002). The small world problem. *Society*, 39(2):61–66.

Krueger, J. and Mueller, R. A. (2002). Unskilled, unaware, or both? The better-than-average

heuristic and statistical regression predict errors in estimates of own performance. *Journal of personality and social psychology*, 82(2):180.

