Extraction of structured information from unstructured documents

Large amounts of valuable information are hidden in unstructured documents (e.g., Web pages, e-mails, office documents).  This thesis should start with a survey on information extraction approaches based on parsing of domain-specific, restricted vocabularies.  Afterwards, a tool should be designed and implemented that allows (a) to describe relevant information at a conceptual level and (b) to generate an extractor based on the conceptual description.  A proof-of-concept should be developed for the domain of scientific conferences announcements (via Web pages and e-mails).