Technical Reports

Creation of a German Corpus for Internet News Sentiment Analysis

AutorFlorian Bütow, Andreas Lommatzsch, Danuta Ploch

The fully automated sentiment analysis on large text collections is an important task in many applications. It is often solved applying supervised machine learning algorithms. The basis for learning powerful sentiment classifiers are annotated datasets, but for many domains and non-English texts hardly or even no datasets exist. In order to support the development of sentiment classifiers for German news articles, we create a new corpus of annotated German news articles related to the Berlin Institute of Technology. Although news articles should be objective, they often excite subjective emotions. In this paper we describe the process of creating a corpus for news documents and discuss our approach for tracing sentiment values back to its hotspots, defining clear rules for assigning polarity scores, and handling the imbalance of class labels. The created corpus consists of sentences each labeled either as NEUTRAL, POSITIVE or NEGATIVE. Given the corpus we train a classifier that yields good classification results and establishes a valuable baseline for sentiment analysis on German news articles.