Available Data Sets

This page lists datasets we have been using in recent publications.  All datasets are publicly available by sending an email to corpora@dai-labor.de. 

Delicious

This dataset contains all public bookmarks of about 950,000 users retrieved from http://delicious.com between December 2007 and April 2008. The retrieval process resulted in about 132 million bookmarks or 420 million tag assignments that were posted between September 2003 and December 2007. No spam filtering was done! Usernames have been anonymized to protect data privacy. The final dataset is around 7GB in size (compressed). 

The full corpus is described and analyzed in: 

Analyzing social bookmarking systems: A del.icio.us cookbook. Robert Wetzker, Carsten Zimmermann, and Christian Bauckhage. In Mining Social Data (MSoDa) Workshop Proceedings, pp. 26–30. ECAI 2008, (July 2008). [pdf

The Slashdot Zoo

This dataset represents the social network of the technology news web site slashdot.org. The network contains 78,000 users and 510,000 relationships of the types friend and foe. The dataset was extracted from Slashdot between May 2008 and February 2009 and contains only the giant connected component that includes the user CmdrTaco (Rob Malda, moderator and founder of Slashdot). The relationship type friend and foe correspond to positive and negative endorsements. 

An analysis of the dataset was presented at WWW 2009: 

The Slashdot Zoo:  Mining a Social Network with Negative Edges.  Jérôme Kunegis, Andreas Lommatzsch, and Christian Bauckhage.  In Proceedings of the International Conference on World Wide Web, pp. 741--750, 2009.  [pdf

DUC 2007 Document-Topic Annotations

This data set contains the annotations for 11 document pairs selected from the DUC 2007 multi-document summarization data set. Three annotators identified semantically similar content units (clauses in sentences) in these document pairs. The content units are similar to 'facts'. The goal of the annotation was to identify semantically similar sentence parts that occur in both documents of a document pair, possibly expressing the same fact with different wording or using synonyms. The data set size is 250KB, and only contains the annotations, but not the source documents. These must be obtained separately from  http://www-nlpir.nist.gov/projects/duc/data.html for legal reasons. 

A description of the annotation procedure and an analysis of the annotations can be found in:

Identifying Sentence-Level Semantic Content Units with Topic Models. Leonhard Hennig, Thomas Strecker, Sascha Narr, Ernesto William De Luca, Sahin Albayrak. In 7th International Workshop on Text-based Information Retrieval (TIR'10), DEXA 2010 . [pdf

Annotated Twitter Sentiment Dataset

This dataset contains tweets that have been human-annotated with sentiment labels by 3 Mechanical Turk workers each. There are 12597 tweets in 4 languages: English, German, French and Portugese.

The labels annotated are: positive, neutral, negative and n/a.

The dataset is analyzed in:

Language-Independent Twitter Sentiment Analysis. Sascha Narr, Michael Hülfenhaus and Sahin Albayrak. In Knowledge Discovery and Machine Learning (KDML), LWA (2012). [pdf