Monday, August 20, 2012

Multiclass Classifer with Hadoop

I've been working large-scale hierarchical classification for the last few months or so. The 'large-scale' part of it was thankfully handled by the Opencloud Hadoop cluster which I got access to as a student of CMU. The large-scale I'm talking about here is primarily a large number of class-labels - the data however must still fit into main memory (for large training set sizes Cascade Support Vector Machines is a good alternative).

I ran a few tests using a simple one-versus-rest Support Vector Machines and Regularized Logistic Regression on some of the datasets released by LSHTC (I'm not sure if the datasets can be shared here, but it can be downloaded from the site after registration). I am surprised that even such simple classifiers are able to get very good performance, pretty close to some of the best results achieved in the contest.

I hope to post the results soon, in the mean time, you can download the source-code.


  1. This information was really very helpful for hadoopers I am going to share this with some of my class mates and friends coz I want them to see that this blog is having such nice content.
    Hadoop Training in hyderabad