Identifying Sentence-Level Semantic Content Units with Topic Models

AutorLeonhard Hennig, Thomas Strecker, Sascha Narr, Ernesto William De Luca, Sahin Albayrak
Quelle21st International Conference on Database and Expert Systems Applications (DEXA 10), 7th International Workshop on Text-based Information Retrieval (TIR '10) 
LinksDownload   |   BibTeX 

Statistical approaches to document content modelling typically focus either on broad topics or on discourse-level subtopics of a text. We present an analysis of the performance of probabilistic topic models on the task of learning sentence-level topics that are similar to facts. The identification of sentential content with the same meaning is an important task in multi-document summarization and the evaluation of multi-document summaries. In our approach, each sentence is represented as a distribution over topics, and each topic is a distribution over words. We compare the topic-sentence assignments learnt by a topic model to gold-standard assignments that were manually annotated on a set of closely related pairs of news articles. We observe a clear correspondence between automatically identified and annotated topics. The high accuracy of automatically derived topic-sentence assignments suggests that topic models can be utilized to identify (sub-)sentential semantic content units.