Textual One-Pass Stream Clustering with Automated Distance Threshold Adaption

Assenmacher D, Trautmann H


Zusammenfassung

Stream clustering is a technique capable of identifying homogeneous groups of observations that continuously arrive in a digital stream. In this work, we inherently refine a TF-IDF-based text stream clustering algorithm by the introduction of an automated distance threshold adaption technique for document insertion and cluster merging, improving the performance during distributional changes in the data stream. By conducting a thorough evaluation study, we show that our new fast approach outperforms state-of-the-art one-pass and batch-based stream clustering algorithms on various existing benchmarking datasets as well as a newly introduced dataset that poses additional challenges to the community. Moreover, we find that current evaluation approaches in the field of textual stream clustering are not adequate for a sound clustering performance assessment of evolving distributions. We thus demand the utilization of time-based evaluation.

Schlüsselwörter
Stream Clustering; Text Mining; Concept Drift



Publikationstyp
Forschungsartikel in Sammelband (Konferenz)

Begutachtet
Ja

Publikationsstatus
Veröffentlicht

Jahr
2022

Konferenz
ACIIDS 2022

Konferenzort
Ho Chi Minh City

Buchtitel
Intelligent Information and Database Systems

Herausgeber
Tran, T et al.

Erste Seite
3

Letzte Seite
16

Verlag
Springer International Publishing

Ort
Cham

Sprache
Englisch

DOI