#### Maurizio Romano <sup>a</sup> , Francesco Mola <sup>a</sup> , Claudio Conversano <sup>a</sup> <sup>a</sup> Department of Business and Economics, University of Cagliari, Cagliari, Italy; **Decomposing tourists' sentiment from raw NL text to assess customer satisfaction**

Decomposing tourists' sentiment from raw NL text to assess customer satisfaction

Maurizio Romano, Francesco Mola, Claudio Conversano

## 1. Introduction

Starting from Natural Language text corpora, considering data that is related to the same context, we define a process to extract the sentiment component with a numeric transformation. Considering that the Na¨ıve Bayes model, despite is simplicity, is particularly useful in related tasks such as spam/ham identification, we have created an improved version of Na¨ıve Bayes for a NLP task: Threshold-based Na¨ıve Bayes Classifier (Romano et al. (2018) and Conversano et al. (2019)).

The new version of the Na¨ıve Bayes classifier has proven to be superior to the standard version and the other most common classifiers. In the original Na¨ıve Bayes classifier, we face two main problems:


# 2. The data

For this study, we have collected two separated – but related – datasets obtained from: Booking.com and TripAdvisor.com. More in detail, with an ad hoc web scraping Python program, we have obtained from Booking.com data about:


Furthermore, for a comparison purpose, we have downloaded additional data from TripAdvisor.com:


# 3. The framework

Considering that the downloaded raw data is certainly not immediately usable for the analysis, we start with a data cleaning process. We start with some basic filtration of the words to

133 Maurizio Romano, University of Cagliari, Italy, romano.maurizio@unica.it, 0000-0001-8947-2220 Francesco Mola, University of Cagliari, Italy, mola@unica.it, 0000-0001-6076-1600 Claudio Conversano, University of Cagliari, Italy, conversa@unica.it, 0000-0003-2020-5129

FUP Best Practice in Scholarly Publishing (DOI 10.36253/fup\_best\_practice)

Maurizio Romano, Francesco Mola, Claudio Conversano, *Decomposing tourists' sentiment from raw NL text to assess customer satisfaction*, pp. 147-151, © 2021 Author(s), CC BY 4.0 International, DOI 10.36253/978-88-5518-304-8.29, in Bruno Bertaccini, Luigi Fabbris, Alessandra Petrucci, *ASA 2021 Statistics and Information Systems for Policy Evaluation. Book of short papers of the opening conference*, © 2021 Author(s), content CC BY 4.0 International, metadata CC0 1.0 Universal, published by Firenze University Press (www.fupress.com), ISSN 2704-5846 (online), ISBN 978-88-5518-304-8 (PDF), DOI 10.36253/978-88-5518-304-8

remove the meaningless ones (i.e. stopwords). Next, we convert emoticons and emoji and we reduce words to their root or base form (i.e., "fishing," "fished," "fisher" are all reduced to the stem "fish").

We use Word Embeddings to reduce the dimensionality of text data.

We recall few fundamentals concepts and terminologies, mostly related to the lexical database WordNet (Miller (1995)), to better understand the next steps:


Moreover, while using the hypernyms proprieties, we adopt a newspaper pre-trained Words Embeddings produced by Google with Word2Vec SkipGram (Mikolov et al. (2013)) for obtaining the vectorial representation of all the words in the dataset (after the data cleaning process). Finally, to finalize the "merging words by their meaning" step, we use K-Means clustering.

As a result, a λ number of clusters in produced, and the centroid-word is chosen as the word that replaces all the other words present in a cluster. In this way the model is trained using, in place of a general Bag-of-Words, a Bag-of-Centroids (of the clusters produced over the Word Embeddings representation of the dataset).

The value of λ is estimated by cross validation, considering the best accuracy (or others performance metrics) within a labelled dataset (E.g. Booking.com or TripAdvisor data).

Once the data is correctly cleaned and all the words with the same meaning are merged in a single one, it is finally possible to compute the overall sentiment score for each observation.

For this purpose, the Lexical Database SentiWordNet (Esuli and Sebastiani (2006)) allows us to obtain the positive as well as the negative score of a particular word. The sentiment score (neg score−pos score) allows us to determine the polarity of each word. So, the overall score of a specific text (i.e. a comment, a review, a tweet) is defined as the average of all the scores of all the words included in the parsed text.

In that way, with this framework (Fig. 1) we create a temporary sentiment label while using a simple threshold over the so produced overall score. Such a temporary label is the useful base for training the Threshold-based Na¨ıve Bayes Classifier.

Figure 1: General Sentiment Decomposition framework

#### 4. Threshold-based Na¨ıve Bayes Classifier

Considering a Natural Language text corpora as a set of reviews *r* s.t.:

$$r\_i = component\_{pos\_i} \cup element\_{neg\_i}$$

where commentpos (commentneg) are set of words (a.k.a. comments) composed by only positive (negative) sentences, and one of them can be equal to ∅, the basic features of Thresholdbased Na¨ıve Bayes classifier applied to reviews' content are as follows. For a specific review *r* and for each word *w* (w ∈ *Bag-of-Words*), we consider the log-odds ratio of *w*,

$$\begin{aligned} \left[LOR(w)\right] &= \log\left[\frac{P(c\_{neg}|w)}{P(c\_{pos}|w)}\right] \approx \\ &\approx \log\left[\frac{P(w|c\_{neg})}{P(\bar{w}|c\_{neg})} \cdot \frac{P(w|c\_{pos})}{P(\bar{w}|c\_{pos})} \cdot \frac{P(c\_{neg})}{P(c\_{pos})}\right] = \dots = \\ &\approx \; press\_w + abs\_w \end{aligned}$$

where cpos(cneg) are the proportions of observed positive (negative) comments whilst pres<sup>w</sup> and abs<sup>w</sup> are the log-likelihood ratios of the events (w ∈ r) and (w /∈ r), respectively.

While calculating those values for all the *w* (w ∈ *Bag-of-Words*) words, it is possible to obtain an output such that reported in Table 1, where we have cpos, cneg, pres<sup>w</sup> and abs<sup>w</sup> for each words in the considered *Bag-of-Words*.


Table 1: Threshold-based Na¨ıve Bayes output

We have then used cross-validation to estimate a parameter τ such that: *c* is classified as "negative" if LOR(c) > τ or as "positive" if LOR(c) ≤ τ .

While comparing the performances on Table 2 and Table 3, we can then ensure that using the Threshold-based Na¨ıve Bayes Classifier in this framework can definitely lead to more precise predictions.


Table 2: Performance metrics obtained using the temporary sentiment label to predict the "real" label. Notice that to estimate the temporary sentiment label only text data is used, and the "real" label it is not provided in the training phase.


Table 3: Performance metrics obtained with Threshold-based Na¨ıve Bayes and 10-fold CV while predicting the real label – trained with the temporary sentiment label

## 5. Conclusions

Compared to other kinds of approaches, the log-odds values obtained from the Thresholdbased Na¨ıve Bayes estimates are able to effectively classify new instances. Those values have also a "versatile nature", in fact they allows to produce plots like in Fig. 2a and Fig. 2b, where customer satisfaction about different dimensions of the hotel service is observed in time.

Figure 2: Category scores observed in time (overall sentiment in black).

#### References

