Identifizierung bildbasierter Kampagnen in sozialen Medien: ein Stream-Clustering-Ansatz

The objective of this thesis is to extend the methodology proposed by Assenmacher et al. (2020) by applying a stream clustering approach on image data collected from the social media platform Reddit. The process involves utilizing established pre-trained models to generate embeddings from the images. These embeddings will be used as input for the clustering process as they capture semantic similarities through the distance in their vector space. Traditional features may not encapsulate the underlying patterns as effectively. The stream of social media image data will be sequentially clustered to mimic real-world production scenarios, where data arrives continuously and requires real-time processing. Each cluster corresponds to a topic discussed in the stream and contains images that share semantic similarities. Cluster weights are dynamically updated: when an image is assigned to a cluster, its specific weight increases; conversely, the weight decreases over time if no new images are assigned to the cluster. This approach enables the identification of suspicious trends over time, such as those indicative of potential campaigns disseminated through coordinated activities or single spam accounts. For instance, a sudden and steep increase in a cluster’s weight may signify a spam attack, characterized by the rapid dissemination of similar content within a short timeframe (Duan et al., 2007). Subsequent meta-data analysis of this conspicuous cluster can provide insights into the nature of the campaign, e.g. by analyzing the number of distinct accounts responsible for a cluster, or checking the average age of the accounts (Assenmacher et al., 2020). The effectiveness of cluster assignments will be evaluated quantitatively using both internal and external evaluation metrics. Additionally, the results will undergo qualitative evaluations to provide further insights into the performance of the clustering approach and the quality of the embeddings. The qualitative evaluations include: examining semantic similarities among images within clusters, detecting simulated campaigns by observing stereotypical campaign patterns with peaks in cluster weights, and visualizing high-dimensional embeddings in lower-dimensional space to gain insights into their structure and relationships.