Language Drift Detector
General Idea and Background
While abusive language – both offline and online – is no new phenomenon, the last years have led to a massive rise of problematic textual content. Many operators of comment sections but also social media sites struggle to balance between allowing also heated discourses and preventing abuse from taking over and driving their readers off. Beside the immanent problem of losing money through deteriorating engagement levels, operators also increasingly face legal pressure. As the option to contribute and comment typically comes free of charge, the outlets have to pay moderators and community managers without generating any immediate returns. This pressured many of them to deactivate the functionalities to discuss content, to further operate economically. Since the disappearance of discussion spaces is not desirable in an open society, researchers set forth to develop NLP- and ML-based solutions that could assist community managers in their daily struggles. However, the domain is still in a developing stage facing a multitude of problems.
Some of the major challenges automated approaches face when being exposed to textual data, are related to the versatility and the dynamics of natural language.
- topical changes – linked to continuously changing names, terms, and concepts
- general development of languages (language as a living construct)
- introduction of new concepts/expressions (hashtags, emojis, @-tags, …)
- obfuscations to hide typically black-listed words or to circumvent terms and words that are commonly perceived as suspicious
These spontaneous as well as gradual changes in language frequently invalidate existing ML/NLP-models that assist moderators in detecting abusive language. However, retraining models is also linked with cost – especially, if analyst have to adjust parameters individually. Hence, it is important to observe and track changes in language to find an “optimal” spot to update existing language and learning models.
The expected outcome of this thesis is a software artifact that should assist platform operators to detect changes in the language used. This artifact should be integrated into an existing web service (API documentation and source code will be made available). Ideally the developed artifact provides further information regarding the occurring changes (overview of recent topics, trending words/concepts, …). The exact phrasing of the research goal is up to discussions with the supervisor and can be adjusted to also accommodate individual ideas of the candidate.
What We Offer
Throughout the thesis the candidate will work on a socially relevant and technically interesting topic. The thesis will be written in the context of the MODERAT! project, which is an ongoing 3-year research project at the Department of Information Systems in collaboration with our practice partner, the German newspaper “Rheinische Post”. Furthermore, for computational tasks we provide access to our HPC cluster (see: https://www.moderat.nrw/news/24) so that you can build and work with complex architectures – and train yourself in working with remote compute clusters.
The candidate should be proficient in Python or willing to gain the required knowledge. Experience with packages and tools such as scikit-learn, nltk, pandas, scipy, word2vec and tensorflow is helpful. As all textual data provided for this thesis will be in German, at least reading capabilities of German on the B2-level (ideally C-levels) are required (writing optional, but not mandatory).
Literature Abusive Language (Excerpt)
- Brunk, J., Niemann, M., and Riehle, D. M. 2019. “Can Analytics as a Service Save the Online Discussion Culture? – The Case of Comment Moderation in the Media Industry,” in Proceedings of the 21st IEEE Conference on Business Informatics, CBI 2019, Moscow, Russia: IEEE, pp. 472–481. (https://doi.org/10.1109/CBI.2019.00061).
Niemann, M. 2019. “Abusiveness Is Non-Binary: Five Shades of Gray in German Online News-Comments,” in Proceedings of the 21st IEEE Conference on Business Informatics, CBI 2019, Moscow, Russia: IEEE, pp. 11–20. (https://doi.org/10.1109/CBI.2019.00009).
Niemann, M., Riehle, D. M., Brunk, J., and Becker, J. 2020. “What Is Abusive Language? Integrating Different Views on Abusive Language for Machine Learning,” in Disinformation in Open Online Media (Vol. 12021), Lecture Notes in Computer Science, C. Grimme, M. Preuss, F. W. Takes, and A. Waldherr (eds.), Hamburg, Germany: Springer, Cham, pp. 59–73. (https://doi.org/10.1007/978-3-030-39627-5_6).
Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., and Chang, Y. 2016. “Abusive Language Detection in Online User Content,” in Proceedings of the 25th International Conference on World Wide Web, WWW ’16, Montreal, Canada: ACM Press, pp. 145–153. (https://doi.org/10.1145/2872427.2883062).
Riehle, D. M., Niemann, M., Brunk, J., Assenmacher, D., Trautmann, H., and Becker, J. 2020. “Building an Integrated Comment Moderation System – Towards a Semi-Automatic Moderation Tool,” in Proceedings of the 12th International Conference on Social Computing and Social Media, SCSM 2020, G. Meiselwitz (ed.), Copenhagen, Denmark: SPringer, Cham, pp. 71–86. (https://doi.org/10.1007/978-3-030-49576-3).
Literature Drift Detection (Excerpt)
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., and Bouchachia, A. 2014. “A Survey on Concept Drift Adaptation,” ACM Computing Surveys (46:4), pp. 1–37. (https://doi.org/10.1145/2523813).
Gao, J., Fan, W., Han, J., and Yu, P. S. 2007. “A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions,” in Proceedings of the 7th SIAM International Conference on Data Mining, SDM 2007, C. Apte, D. Skillicorn, Bi. Liu, and S. Parthasarathy (eds.), Minneapolis, MN, USA: SIAM, pp. 3–14.
Lazarescu, M. M., Venkatesh, S., and Bui, H. H. 2004. “Using Multiple Windows to Track Concept Drift,” Intelligent Data Analysis (8:1), pp. 29–59. (https://doi.org/10.3233/ida-2004-8103).
Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., and Zhang, G. 2018. “Learning under Concept Drift: A Review,” IEEE Transactions on Knowledge and Data Engineering (31:12), pp. 2346–2363. (https://doi.org/10.1109/TKDE.2018.2876857).
Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., and Herrera, F. 2017. “A Survey on Data Preprocessing for Data Stream Mining: Current Status and Future Directions,” Neurocomputing (239), pp. 39–57. (https://doi.org/10.1016/j.neucom.2017.01.078).