Textual One-Pass Stream Clustering with Automated Distance Threshold Adaption

Assenmacher D, Trautmann H


Abstract

Stream clustering is a technique capable of identifying homogeneous groups of observations that continuously arrive in a digital stream. In this work, we inherently refine a TF-IDF-based text stream clustering algorithm by the introduction of an automated distance threshold adaption technique for document insertion and cluster merging, improving the performance during distributional changes in the data stream. By conducting a thorough evaluation study, we show that our new fast approach outperforms state-of-the-art one-pass and batch-based stream clustering algorithms on various existing benchmarking datasets as well as a newly introduced dataset that poses additional challenges to the community. Moreover, we find that current evaluation approaches in the field of textual stream clustering are not adequate for a sound clustering performance assessment of evolving distributions. We thus demand the utilization of time-based evaluation.

Keywords
Stream Clustering; Text Mining; Concept Drift



Publication type
Research article in proceedings (conference)

Peer reviewed
Yes

Publication status
Published

Year
2022

Conference
ACIIDS 2022

Venue
Ho Chi Minh City

Book title
Intelligent Information and Database Systems

Editor
Tran, T et al.

Start page
3

End page
16

Publisher
Springer International Publishing

Place
Cham

Language
English

DOI