XSLT-based Web-Content Extraction

AutorThomas Strecker, Danuta Ploch, Martin Kurze, Sahin Albayrak

In this paper, we describe the Semantic Contents Acquisition Framework (SCAF), an approach to retrieving content from the web and other kinds of information or data sources. It is based on extending standard XSLT with elements for the access to and storage of data. The extensions, embedded into a management and storage infrastructure, provide an entry point to general and personalized harvesting of dedicated sources. In addition to the general presentation of the extensions we have implemented, we present results of experiments performed on automatic retrieval of typed data from semi-structured web sources.