Several months back, I was mentioning to Alex Smola about the high memory requirements when dealing with learning number of parameters - especially multiclass classification when the dimensions and number of classes are large. He suggested using Feature hashing as a technique to reduce the number of dimensions - thereby reducing the number of parameters to be learnt. I finally got around to testing how effective feature hashing really is.
TLDR: I tried feature hashing - it works but mostly at the cost of accuracy.
Just to give an idea about the scale of the data that I am interested in, here is a list of datasets with the amount of storage required to store the parameters (using floats).
Dataset | #Classes | #Dims | #Params | Param size (floats) |
---|---|---|---|---|
NEWS20 | 20 | 53,975 | 1,079,500 | 4.3 MB |
LSHTC-small | 1,139 | 51,033 | 58,126,587 | 232 MB |
LSHTC-large | 12,294 | 347,256 | 4,269,165,264 | 17 GB |
Caltech-256 | 256 | 85,000 | 21,760,000 | 87 MB |
IPC | 451 | 541,869 | 244,382,919 | 1 GB |
DMOZ-3 | 11,947 | 348,548 | 4,164,102,956 | 16 GB |
Overall, there are a couple of different ways to reduce the size of the parameters,
1. Efficient Storage (Feature Quantization)
The key idea is that we do not need very high precision to do ML. We can reduce the number of bits by making the representation coarser. In this paper, they use feature quantization in conjunction with a learning algorithm - online gradient descent and show that they can learn parameters efficiently without too much loss.
2. Random Projections
The idea is to use a random matrix (different variants exist) to project the data into a lower dimensional space. Random projections mostly preserve the distances between the documents (or rather there exists a lower dimensional space that preserves most of the distances of points in high dimensional space). Since the distances are approximately preserved, the learning algorithm should still be able to do a good job.
3. Feature Hashing
Pick a parameter B (bucket size) that defines the new feature size. The idea is to hash every feature index in the dataset into a bucket from 1 to B and add the corresponding feature-value into the bucket. This vector of buckets defines the new feature representation for the document.
The authors also suggest replicating the features multiple times and apply the hashing. Procedurally, this can be viewed a simpler form of random projection, where the projection matrix consists of only 0's and 1's and each feature is projected to exactly one lower dimensional space.
Advantages
- Can be done online, i.e. hash on the fly while training.
- Lower memory requirements compared to random projections.
- If hashing is done offline, it could potentially reduce size of training dataset. When features are not replicated, it will mostly decrease.
Disadvantages
- We need an additional 4*C*log(B)*P bits (C is replication factor, B is bucket size, P is dimension, 4 is the size of an integer) to store the hash function during training and testing.
NOTE: Since our goal is reduce size of parameters, it makes no sense to have different hashing functions for different classes - it will only increase memory requirements.
I am not aware of a 'easier' way to store the hash functions - remember we need them to be pairwise independent. - Training time is 'C' times - each feature needs to be processed 'C' additional times. But it should be negligible as long as 'C' is reasonable like 4 or 5.
4. Other methods
- Feature selection Typical Feature 'selection' methods like Document frequency (typically used only in document classification) could work. Other techniques include correlation, mutual information and variants, but however require supervision i.e. class-labels.
- PCA/SVD : Other general dimensionality reduction techniques like PCA, SVD also reduce memory requirements. In my opinion, I would expect PCA or SVD to work better than random projection. In fact, the size of the PCA or SVD projection matrix will be the same as random projection matrix - both will have a matrices of size #original dimension x #lower dimension. However computing PCA/SVD is computationally intensive as opposed to generative a random matrix
TESTS
I did some small tests to see the trade-off between reduction in parameter sizes to the accuracy (Micro-F1 and Macro-F1). The final task is to classify the instances for which I used a one-versus-rest Binary SVM. Note that the regularization parameter is "tuned" on the test set though - I dint want to spend a lot of time doing validation.
I plot two graphs for each dataset, the Y axis denotes the metric (either Micro-F1 or Macro-F1) and the X axis denotes the size of the bucket used for feature hashing.
The dotted red-line indicated the performance with no hashing (the vertical line denotes the size of the original feature space)
NEWS20
LSHTC-small
LSHTC-large
Caltech-256
IPC
DMOZ3-sub
CONCLUSIONS
- Let's look at the loss in performance (Micro-f1) at an order of magnitude (~10x) reduction in feature space: The loss is around 8% in News20, 5% in LSHTC-small, 4% in Caltech256, 2% in DMOZ3-sub, 1% in LSHTC-large, .6% in IPC.
Whether these are in the acceptable range or not, is arguable. In my opinion, I think greater than 1% loss is not ok. So going by this standard, feature hashing sometimes loses performance pretty badly. - It seems to pretty well on two datasets - IPC and DMOZ3-sub. It is able to drastically reduce the number of parameters with hardly any loss in the either metric. May be these datasets are easy ones ? I should try other simpler feature selection techniques like term frequency and see how that works.
My dissertation has a similar topic. If you need help you can find assistance here http://dissertationempire.com/.
ReplyDelete