Researchers in the University of Chieti-Pescara are developing statistical methodologies to analyse Italian hate speech data (more and more present in social networks, in particular, thanks to an Italian political party). I’ve asked Alice Tontodimamma, a PhD student there, to introduce this subject, which is not only interesting but extremely useful in these years. Here’s her contribution.
The exponential growth of social media has brought an increasing propagation of hate speech and hate-based propaganda. Hate speech is commonly defined as any communication that disparages a person or a group on the basis of some characteristic such as race, colour, ethnicity, gender, sexual orientation, nationality, religion.
It is a natural activity, in societies where freedom of speech is recognised, for people to express their opinions about certain subjects. Evidently the development of social media has created new means for people to communicate their ideas and share them with others: we have moved from an era in which individuals could communicate their ideas only to a small number of other individuals, usually orally, in a meeting place such as the town square, to an era in which individuals can make free use of a variety of diffusion channels in order to communicate, instantaneously, with people who are at a great distance. Besides, more and more users take advantage of these platforms not only to interact with others but also to share the news.
The detachment created by being enabled to write, without any obligation to reveal oneself directly, means that this new medium of virtual communication allows people to feel greater freedom in the way they express themselves; unfortunately, there is also a dark side of this system: social media have become a fertile ground for heated discussions, usually resulting in insulting and offensive language usage.
The ease with which hate can be spread is not, nowadays, a phenomenon confined to the internet; it influences real society and can affect individual behaviour: countries are recognizing hate speech as a serious problem. This has led to a number of international initiatives being proposed, aimed at qualifying the problem and developing effective counter-measures. In this context, it is not surprising that most existing efforts are motivated by the impulse to detect and eliminate hateful messages or hate speech.
That is why increased attention and accordingly a continuously growing publication rate in the research area of hate speech may be observed. A wide variety of disciplines, among them Social Science, Psychology, Statistics, and Computer Science, are engaged in research into hate speech.
Until 2011, the publications on the topic remained limited, with less than fifty publications per year. Since 2008, an increasing number of publications could be observed every year, with a peak of publications in 2018
The question remains: will this growing trend continue during the next years?
The Price’s law states that the development of science goes through four phases. In the first phase (known as the precursor) a small group of scientists begins to publish research into a new field. The second phase is the proper exponential growth, since the expansion of the field attracts an increasing number of scientists, as many aspects of the subject still have to be explored. In the third phase, there is a consolidation of the body of knowledge, which is followed by a decrease in the number of publications. The growth of scientific production becomes linear, so, ultimately, the aspect of the curve transforms from exponential to logistic. The fourth phase corresponds to the collapse of the domain and the important reduction of publication.
It can be said that research about hate speech has probably now entered the second phase of development: an increasing amount of research is being published, but there is still room for improvement in many aspects, among them the need for statistical methodologies and software tools to enable hate speech to be identified automatically, and then enable effective counter-measures to be created.
Institutions and companies agree on the importance of automatic detection of hate speech. In recent years, the European Union has developed a number of programs for preventing the appearance of hate speech online, and various companies and platforms have a clear interest in the detection and removal of hate speech: for instance, newspapers need to attract advertisers and therefore cannot risk becoming known as platforms for hate speech; social media companies wish to maximise the quality of communication service that they offer to their users.
There is, in general, and especially in Italy, a lack of systematic monitoring, documentation, and data collection for online hate speech. Furthermore, it is rare to find works with open source code, and also no open source tools are available for the automatic detection of hate speech.
Regarding the main aspects of previous research, we can say that:
– research generally focuses on datasets containing messages collected from social networks: the most commonly used source is Twitter;
– the most frequent approach consists in building a Machine Learning model for the classification of hate speech: the most widely used algorithms are SVM, Random Forests, and Decision Trees;
– the most widely used language is English;
– researchers tend to begin by collecting and classifying new messages; often those datasets remain private.
It is evident that the detection of hate speech involves much more than simple keyword spotting. For this reason, we list the main difficulties:
– authors do not use public datasets, and do not publish the new ones they collect: this makes it very difficult to compare results and conclusions;
– a low rate of agreement (33%) in hate speech classification by humans, indicating that such classification would be an even harder task for machines;
– the task of annotating a dataset is also more difficult because it requires expertise about culture and social structure, and the evolution of social phenomena and language makes it difficult to track all racial and minority insults.
This is undoubtedly an area that has profound societal impact and which presents many research challenges.