A Hybrid Machine Learning Approach: The Use of Human in the Loop to Enhance the Moderation of Abusive Language

While originally intended to enhance information exchange and discussions, the menace of abusive language has been luring since the inception of the internet [1]. After defiling the traditionally user-centred social media sites, comment sections of news media providers have become a target for extremists, trolls and other individuals with evil intentions. Overwhelmed by the massive amounts of inacceptable or even illegal user content many outlets resorted to closing their forums and comment sections severely inhibiting the public discourse.

The traditional approaches published since 2012 have been working with fixed and closed data sets. While there is a certain convergence visible in the field, these established methods have one severe disadvantage: They fail to accommodate the fact that language changes over time (or in more technical terms: concept drift [e.g., 3 and4]).

To account for these changes this thesis should evaluate the applicability of machine learning algorithms for data streams [for an introduction see, e.g., 3 and 4]. The thesis should cover both the foundational conceptual work as well as a prototypical implementation. However, the exact balance of the two parts is negotiable.

This thesis allows you apply and broaden your data analytics skills and to work with cutting-edge analytics software (e.g. Apache Spark [8], Apache Flink [6], Spark ML [5], Apache Storm [7]…). Since you will be working in a very dynamic field of research, you will find plenty of recent publications but still have the chance to make your very own contribution to this highly relevant field.

The thesis is embedded into the Cyberhate-Mining research project (www.hatemining.de, [2]).

Literature

[1]        A. Bastidas, E. Dixon, C. Loo, and J. Ryan, “Harassment detection: a benchmark on the #HackHarassment dataset,” in Proceedings of the Collaborative European Research Conference, 2016, pp. 76–79.

[2]        S. Köffer, D. M. Riehle, S. Höhenberger, and J. Becker, “Discussing the Value of Automatic Hate Speech Detection in Online Debates,” in Tagungsband Multikonferenz Wirtschaftsinformatik 2018, 2018.

[3]        V. Lemaire, C. Salperwyck, and A. Bondu, “A Survey on Supervised Classification on Data Streams,” in Lecture Notes in Business Information Processing, vol. 205, E. Zimányi and R.-D. Kutsche, Eds. Berlin, Germany: Springer International Publishing, 2015, pp. 88–125.

[4]        A. A. Benczúr, L. Kocsis, and R. Pálovics, “Online Machine Learning in Big Data Streams,” Feb. 2018.

[5] The Apache Software Foundation, “MLlib: Main Guide - Spark 2.3.0 Documentation,” spark.apache.org, 2018. [Online]. Available: https://spark.apache.org/docs/latest/ml-guide.html. [Accessed: 22-May-2018].

[6] The Apache Software Foundation, “Apache Flink: Scalable Stream and Batch Data Processing,” flink.apache.rog, 2017. [Online]. Available: https://flink.apache.org/. [Accessed: 22-May-2018].

[7] The Apache Software Foundation, “Apache Storm,” Apache Software Foundation, 2015. [Online]. Available: http://storm.apache.org/. [Accessed: 22-May-2018].

[8] The Apache Software Foundation, “Apache SparkTM - Unified Analytics Engine for Big Data,” spark.apache.org, 2018. [Online]. Available: https://spark.apache.org/. [Accessed: 22-May-2018].