Analysis of public microblogging and protocol data using Spark: The TTIP case

The era of Big Data is dominated by tools that do not originate from the traditional SQL-based database and Data Warehouse (DWH) area. DWHs are used for high-value, structured, and integrated data in OLAP scenarios. Big Data tools have a wider range of use, as these are not restricted to structured data, but can often handle a larger quantity of more heterogeneous data, e.g. XML files, JSON, or pictures and video. Besides NoSQL systems, Apache Hadoop has gained widespread attention in both academia and practice. It includes processing with the MapReduce paradigm, which allows distributing a task to many simple workers, enabling MapReduce applications to scale horizontally with commodity machines for Big Data. Through imperative programming, it is more flexible than the declarative SQL of traditional systems, but this flexibility also increases complexity. The latest Apache Hadoop 2.0 stack contains a variety of novel tools, e.g., a distributed file system (HDFS) and cluster management tools. This opens up several options for using Hadoop with a DWH (e.g., Hadoop as ETL tool, or Hadoop as DWH-complement) to create a Big Data-ready DWH that enriches the BI approach of a company. The goal of this thesis is to analyze if and how Hadoop with or without a DWH can be used as part of a Big Data-ready BI initiative. The thesis will analyze relevant use cases and technologies and implement an examplary Big Data use case using typical Big Data-ready technologies that provides additional insights which traditional technology could not deliver.