Towards Detection of Abusive Language in German Online Media - Concept and Implementation of a Machine-Learning Approach

Online debates have gone off the rails. The amount of hateful content on social media websites has sparked a national debate on how to deal with this. It is likely that commenters try to influence the shaping of public opinion and media coverage on polemic topics. Social networks promote such behaviors through their (often purely economic-oriented) design. For instance, emotionality is rewarded with higher shares and attention. As a result, many comments seem to be very ideological, favoring extreme opinions – balanced arguments are rare. Analytic methods to analyze online news comments are required – not to apply automatic censorship, but to reduce the enormous effort for manual moderation to curate the best of the web.

In this master thesis, the student shall evaluate and implement analytical methods to detect toxic contents in online debates. To this end, state-of-the-art methods from natural language processing, and supervised learning will be combined using technologies such as PHP, Python and Web Scraping techniques. The thesis may address the following research questions:

  • What text mining techniques are best suited to detect toxic contents in online debates?
  • What are the requirements of algorithmic techniques to detect hate speech with respect to algorithm transparency and user acceptance?
  • How should the design of online debate portals be adopted, respectively, which automatic methods may help community managers to do their job well?

This thesis builds upon the student research project Cyberhate-Mining (www.hatemining.de) and will focus on German text corpus. Good German reading skills will be helpful.