Available datasets

This page lists datasets we have been using in recent publications.  All datasets are publicly available by sending an email to corpora(at)dai-labor.de.

Delicious

This dataset contains all public bookmarks of about 950,000 users retrieved from http://delicious.com between December 2007 and April 2008. The retrieval process resulted in about 132 million bookmarks or 420 million tag assignments that were posted between September 2003 and December 2007. No spam filtering was done! Usernames have been anonymized to protect data privacy. The final dataset is around 7GB in size (compressed).

The full corpus is described and analyzed in:

Analyzing social bookmarking systems: A del.icio.us cookbook. Robert Wetzker, Carsten Zimmermann, and Christian Bauckhage. In Mining Social Data (MSoDa) Workshop Proceedings, pp. 26–30. ECAI 2008, (July 2008). [pdf]

The Slashdot Zoo

This dataset represents the social network of the technology news web site http://slashdot.org. The network contains 78,000 users and 510,000 relationships of the types friend and foe. The dataset was extracted from Slashdot between May 2008 and February 2009 and contains only the giant connected component that includes the user CmdrTaco (Rob Malda, moderator and founder of Slashdot). The relationship type friend and foe correspond to positive and negative endorsements.

An analysis of the dataset was presented at WWW 2009:

The Slashdot Zoo:  Mining a Social Network with Negative Edges.  Jérôme Kunegis, Andreas Lommatzsch, and Christian Bauckhage.  In Proceedings of the International Conference on World Wide Web, pp. 741--750, 2009. [pdf]