Information Extraction

ContactDanuta PlochMichael Meder


The "Information Extraction"-Cluster works on the development of methods and tools which support information and data services. These methods and tools comprise of the collection of data from different sources, their enhancement with typed metadata, and the identification of relationships between items. The extracted contents can be validated and relationships between the different contents can be identified. 

The CC IRML focusses on the tasks of named entity disambiguation and smart spidering of dynamic websites.

Named Entity Disambiguation: deals with the problem of identifying which person or location is talked about if there are multiple referents for a given name. For example, there are more than 60 places in the world named 'San José', not counting universities, islands or music bands. Finding out which real-world entity a document (or a user search query) refers to is an important step toward truly semantic representations of knowledge.

Smart Spidering: Further automation of the crawling is achieved with the “Smart Spider”. The spidering process performs, in addition to traditional contents analysis, a visual analysis of the web pages. Stable visual or textual structures are identified and classificators are trained to learn their mapping onto the predefined content types.