Automatization of Data Preprocessing Tasks


In our contemporary age, data is been captured, stored, analyzed, and used in various scenarios. These information typically are gathered/produced by machines in a factory, within applications, or manually inserted by the responsible user. Especially in the latter case, mistakes occur, i.e., data gets inserted with typos (instead of 100, 1 000 is inserted) or entire values are skipped due to laziness. From a scientific viewpoint the typical problems encountered when analyzing a given data set are:

  • Missing values
  • Outliers
  • Duplicates 

The identification and imputation or curtailing of values is a tedious and time consuming process. Hence, the objective of this thesis topic is to implement an automated processing pipeline, which addresses these aforementioned issues.


This thesis topic requires you to conceptually use, develop and enhance existing imputation, outlier detection, and deduplication techniques. Therefore, you should be familiar with the content of the lectures

  • Data Analytics I
  • Data Analytics II

Furthermore, it is required to translate the devised concept in a practical scenario. In other words, you have to implement your ideas either in Python or R. Thus, a profound knowledge of either of these two programming languages is required as well.