Data Warehousing and Hadoop

The era of Big Data is dominated by tools that do not originate from the traditional SQL-based database and Data Warehouse (DWH) area. DWHs are used for high-value, structured, and integrated data in OLAP scenarios. Big Data tools have a wider range of use, as these are not restricted to structured data, but can often handle a larger quantity of more heterogeneous data, e.g. XML files, JSON, or pictures and video. Besides NoSQL systems, Apache Hadoop has gained widespread attention in both academia and practice. It includes processing with the MapReduce paradigm, which allows distributing a task to many simple workers, enabling MapReduce applications to scale horizontally with commodity machines for Big Data. Through imperative programming, it is more flexible than the declarative SQL of traditional systems, but this flexibility also increases complexity. The latest Apache Hadoop 2.0 stack contains a variety of novel tools, e.g., a distributed file system (HDFS) and cluster management tools. This opens up several options for using Hadoop with a DWH (e.g., Hadoop as ETL tool, or Hadoop as DWH-complement) to create a Big Data-ready DWH that enriches the BI approach of a company. The goal of this thesis is to analyze if and how Hadoop with or without a DWH can be used as part of a Big Data-ready BI initiative of a company and which limitations and potentials exists. Such systems consisting of multiple physical data storage entities can also be called Logical Data Warehouse (LDWH) if the data is not stored in one single central DWH anymore. The thesis will provide an overview of relevant use cases from practice, which should be gathered, analyzed, and compared with an appropriate method to arrive at a structured comparison.