Building a Data Pipeline for a Temporal Knowledge Graph for Financial Sentiment Analysis

Text-based sentiment measures derived from financial news have long been used as predictors of stock returns. However, LLMs combined with standard Retrieval-Augmented Generation (RAG) reach their limits when it comes to analyzing complex financial dynamics: they inadequately model the evolving relationships between companies, supply chains, and market events over time, thereby overlooking the heterogeneous effects of news on different companies. Temporal Knowledge Graphs (TKGs) address this problem by enabling multi-hop reasoning and explicitly modeling financial causalities. A prerequisite for this is a robust, quality-assured data foundation — building such a foundation constitutes a methodological challenge in its own right and forms the basis for future research in this area.

The thesis aims at the design and implementation of a scalable end-to-end pipeline that transforms unstructured financial news into a temporal knowledge graph. The focus lies on the ingestion and preprocessing of news texts, the LLM-based extraction of entities, relationships, and sentiment values, as well as the conversion into a graph-compatible format. Building on this, a Neo4j-based knowledge graph is to be constructed that captures entity relationships together with timestamps, sentiment values, and confidence levels. The underlying schema should be designed in such a way that it enables subsequent temporal queries and multi-hop analyses, thereby serving as a solid foundation for follow-up work — particularly in the area of predictive modeling. A key component of the thesis is the quality assurance of the pipeline. To this end, suitable methods for evaluating extraction quality (such as entity linking, relationship extraction, and sentiment assignment) are to be developed and applied, and mechanisms for monitoring the pipeline in ongoing operation are to be integrated.

For a bachelor's thesis, the scope does not include the transformation into a knowledge graph but instead focuses on the extraction pipeline.