Evaluation of Text Extraction Methods in Relation to Privacy Policies

Text-from-HTML extraction tools are designed to remove boilerplate and extract the plain/Markdown text of websites and web blogs [1,2]. However, recent experiments on privacy policies demonstrated that, compared to manually extracted texts by humans, only a small fraction of such tools are able to sufficiently extract the texts of scraped privacy policies while suffering from shortcomings and limitations [3].
This thesis aims to develop a specialized automated text-from-HTML extraction tool for multilingual privacy policies, which outperforms the current benchmark [3]. 
The tool is expected to be written in Python. Additionally, decent knowledge on front-end development is key for this thesis. 


References:

[1] Out-of-the-Box and Into the Ditch? Multilingual Evaluation of Generic Text Extraction Tools (link)
[2] Evaluating scraping and text extraction tools for Python (link)
[3] Unifying Privacy Policy Detection (link)