Sunday, August 19, 2012


I thought I should start my blog by doing some 'good' for the society. Here is a list of classification datasets that I've collected over the last 4 years which will hopefully be useful for other people.

The format of each line in the dataset is

<class-label><,class-label>* [ feature-id:feature-value]+  # <docid-id>

If there is an associated hierarchy between the class-labels, the format of each line in the hierarchy file is

<node-id> <#children> [ node-id]+


#Training Instances 7770
#Testing Instances 3019
#Class labels90
#Features 18637
Avg #Class labels per Instance 1.23

The dataset is the ApteMod split of Reuters-21578, covering 90 classes. The original dataset had 118 classes, but only classes with atleast one training example and one test example were considered. The dataset consists of newswire articles from Reuters in the year 1987. All articles were stop-word removed, stemmed and applied with the 'ltc' term weighting scheme.



#Training Instances 46324
#Testing Instances 28926
#Class labels451
#Features 541869
Avg #Class labels per Instance 1

The dataset was released by the World Intellectual Property Organization to promote research in patent categorization. The class-labels in the dataset have an associated 5 level hierarchy in which the patents are assigned to the leaf-nodes of the hierarchy. The classification task in this version of the dataset is performed at the 4th level by collapsing the patents from the 5th level. This was done to avoid data sparsity issue. The text from all parts of the patent are indexed, stopword-removed, stemmed and term-weighted.


#Training Instances 5303
#Testing Instances 1326
#Class labels17
#Features 14601
Avg #Class labels per Instance 1.26

This dataset was crawled by me from Citeseer. The link from which I crawled is no longer working though. The text from the abstract of the research papers are indexed, stop-word removed, stemmed and applied with the 'ltc' term weighting scheme.The ground-truth was provided by Citeseer (Although I'm not sure how the ground-truth was generated).


Here is another version of the dataset with more fine grained labels and associated hierarchy between the class-labels.

#Training Instances 6798
#Testing Instances 2265
#Class labels83
#Features 19912
Avg #Class labels per Instance 1.44



#Training Instances 23149
#Testing Instances 78265
#Class labels101
#Features 48734
Avg #Class labels per Instance 3.18

The dataset is a sequel to the Reuters-21578 dataset. It consists of all English newswire articles from Reuters from 1996 to 1997. An article can be associated with 3 'sets' of class-label codes - Topic codes, Region codes, Industrial codes along with associated hierarchies. This dataset contains the Topic codes as the class-labels.


Other datasets

Here are a few other datasets that I've used in the past (mostly from UCI)

Scene Download Source
Yeast Download Source
Emotions Download Source
Image Download Source
Wine Download Source
CLEF Download Source


  1. Your stats for the Reuters ApteMode split differ from those published by Y. Yang in 'A re-examination of text categorization methods'.

    it would probably also be more beneficial to give the much more popular 'ModApte' split, or simply refer people to one of the official Reuters-21578 pages -

  2. Yes, this is not true 'ModApte' split, because only classes with atleast one training and testing example were considered. I will try to put up the true ModApte split sometime. Thanks !