Available Data Sets

This page lists datasets we have been using in recent publications.  All datasets are publicly available by sending an email to corpora@dai-labor.de. 

GerNED: German NED Dataset

The GerNED corpus contains resources for the evaluation of German-language Named Entity Disambiguation (NED) systems. The task of NED is to link proper names of persons, locations or organizations to a predefined knowledge base or to recognize that a proper name has no corresponding entry in the knowledge base. 

The dataset includes over 2400 annotated proper names that were found in a large corpus of German-language news articles and linked to a knowledge base derived from Wikipedia. The news articles and the knowledge base (German Wikipedia dump) are both part of the GerNED corpus as well. 

The full corpus is described and analyzed in: 

GerNED: A German Corpus for Named Entity Disambiguation. Danuta Ploch, Leonhard Hennig, Angelina Duka, Ernesto William De Luca and Sahin Albayrak In: Proceedings of the 8th International Conference on Language Resources and Evaluation; 2012 [pdf]

plista News Recommendation Dataset

In the context of the ACM RecSys 2013 News Recommender Systems Workshop and Challenge, we released a dataset on 1 July 2013 consisting of 84 million interaction records that plista processed in June 2013 for recommending news articles in real time. 

The full corpus is described and analyzed in: 

The plista Dataset Benjamin Kille, Frank Hopfgartner, Torben Brodt, Tobias Heintz. In Proc. News Recommendation Workshop and Challenge, ACM ICPS, 10 2013. 

Delicious

This dataset contains all public bookmarks of about 950,000 users retrieved from delicious.com between December 2007 and April 2008. The retrieval process resulted in about 132 million bookmarks or 420 million tag assignments that were posted between September 2003 and December 2007. No spam filtering was done! Usernames have been anonymized to protect data privacy. The final dataset is around 7GB in size (compressed). 

The full corpus is described and analyzed in: 

Analyzing social bookmarking systems: A del.icio.us cookbook. Robert Wetzker, Carsten Zimmermann, and Christian Bauckhage. In Mining Social Data (MSoDa) Workshop Proceedings, pp. 26–30. ECAI 2008, (July 2008). [pdf

The Slashdot Zoo

This dataset represents the social network of the technology news web site slashdot.org. The network contains 78,000 users and 510,000 relationships of the types friend and foe. The dataset was extracted from Slashdot between May 2008 and February 2009 and contains only the giant connected component that includes the user CmdrTaco (Rob Malda, moderator and founder of Slashdot). The relationship type friend and foe correspond to positive and negative endorsements. 

An analysis of the dataset was presented at WWW 2009: 

The Slashdot Zoo:  Mining a Social Network with Negative Edges.  Jérôme Kunegis, Andreas Lommatzsch, and Christian Bauckhage.  In Proceedings of the International Conference on World Wide Web, pp. 741--750, 2009.  [pdf

DUC 2007 Document-Topic Annotations

This data set contains the annotations for 11 document pairs selected from the DUC 2007 multi-document summarization data set. Three annotators identified semantically similar content units (clauses in sentences) in these document pairs. The content units are similar to 'facts'. The goal of the annotation was to identify semantically similar sentence parts that occur in both documents of a document pair, possibly expressing the same fact with different wording or using synonyms. The data set size is 250KB, and only contains the annotations, but not the source documents. These must be obtained separately from  http://www-nlpir.nist.gov/projects/duc/data.html for legal reasons. 

A description of the annotation procedure and an analysis of the annotations can be found in:

Identifying Sentence-Level Semantic Content Units with Topic Models. Leonhard Hennig, Thomas Strecker, Sascha Narr, Ernesto William De Luca, Sahin Albayrak. In 7th International Workshop on Text-based Information Retrieval (TIR'10), DEXA 2010 . [pdf

Annotated Twitter Sentiment Dataset

This dataset contains tweets that have been human-annotated with sentiment labels by 3 Mechanical Turk workers each. There are 12597 tweets in 4 languages: English, German, French and Portugese.

The labels annotated are: positive, neutral, negative and n/a. 

The dataset is analyzed in:

Language-Independent Twitter Sentiment Analysis. Sascha Narr, Michael Hülfenhaus and Sahin Albayrak. In Knowledge Discovery and Machine Learning (KDML), LWA (2012). [pdf