Saturday, December 21, 2013

Feature Hashing

Several months back, I was mentioning to Alex Smola about the high memory requirements when dealing with learning number of parameters - especially multiclass classification when the dimensions and number of classes are large. He suggested using Feature hashing as a technique to reduce the number of dimensions - thereby reducing the number of parameters to be learnt. I finally got around to testing how effective feature hashing really is.

TLDR: I tried feature hashing - it works but mostly at the cost of accuracy.

Monday, June 24, 2013

Fast Matrix Multiply and ML

Being an impatient person, I've always tried to make my 'code' run faster. In my experience, in most ML algorithms the 'core' bottleneck in computation seems to be one of the following
  1. (Dense/Sparse) Matrix - Vector product
  2. (Dense/Sparse) Matrix - Dense Matrix product
For example, consider any binary classification or regression task with N examples and dimension P. The computational bottleneck (for training and testing) is the product of the data matrix X [NxP] (sparse or dense depending on the data) and the parameter vector w [Px1]. In any multi-class classification task with K classes, the bottleneck is the product of X and the parameter matrix W [PxK].

Wednesday, March 20, 2013

Distributed Training of Logistic Models

Recently some of our work on Training large-scale logistic models got accepted into ICML. Basically, we have a training procedure for regularized Multinomial Logistic Regression (RMLR) with very large number of multinomial outcomes. Typically, with large number of outcomes and high dimensions, even holding all the parameters simultaneously might not be possible. Therefore we devise a parallel training of RMLR by replacing the objective using a more 'parallelizable' function. It turns out this that optimizing the new 'parallelizable' objective does not change the optimal solution ! Here is the paper and the large-scale Hadoop based code!

Paper      Code 

Friday, February 15, 2013

Code for File-type Identification

Off late I've got many requests to share/queries about the the code used in our work 'Statistical Learning for file-type Identification' [pdf]. Unfortunately, most of the setup and experiments were run using hacky python scripts which I no longer have access to. However, I was able to find one tool that uses 2-gram based features from files to identify the file-type. Here is the source code.