Monday, June 24, 2013

Fast Matrix Multiply and ML

Being an impatient person, I've always tried to make my 'code' run faster. In my experience, in most ML algorithms the 'core' bottleneck in computation seems to be one of the following
  1. (Dense/Sparse) Matrix - Vector product
  2. (Dense/Sparse) Matrix - Dense Matrix product
For example, consider any binary classification or regression task with N examples and dimension P. The computational bottleneck (for training and testing) is the product of the data matrix X [NxP] (sparse or dense depending on the data) and the parameter vector w [Px1]. In any multi-class classification task with K classes, the bottleneck is the product of X and the parameter matrix W [PxK].

Even in completely unsupervised settings such as clustering, we need to invariably either compute a distance measure between the data points and the cluster means or calculate some sufficient statistics from the data. In most cases, this can be written efficiently in terms of a matrix products. I give two examples - (a) Kmeans and (b) Multinomial Mixtures.

For kmeans, the cluster means is denoted by  M [PxK], where K is the number of clusters. The distance matrix D [NxK] is the euclidean distance between each row in X and each column in M. To compute D

D = ((X∘X)eD)eKT  -  2 XxM  +  eK((M∘M)eD)

where  ed denotes a vector of ones with dimension [dx1]. The expressions on the right and left of the RHS can be computed efficiently in O(NxP) and O(PxK), but the computation of the term in the middle is O(NxPxK) which is the bottleneck. In Multinomial Mixture, instead of the cluster means we have class-probabilities in matrix M. In this case, the distance measure is simply the log-probability of the instance given a class i.e. Xxlog(M) which is again matrix product.

So, how do we make ML implementations faster? Matrix multiplication seems like it should be a very well studied topic and all that remains is to choose an efficient matrix multiplication library and plug it in. Unfortunately, 'choosing' the right library turned out to be a much harder problem than I thought. There are tons of implementations of the BLAS interface and each has a different implementation of the matrix multiplication function (see DGEMM function).

I performed a series of dense-dense matrix multiplication experiments on 5 BLAS implementations - ATLAS, EIGEN matrix template library, Intel's Math Kernel Library (MKL), OpenBLAS, and AMD's core Math Library (ACML) so that I could choose the 'right' library. Note that MATLAB internally uses Intel's MKL library although with some overhead.

All the tests were performed on 48 core and 66GB RAM machine with configuration given here.

Preliminaries

  1. ATLAS: The entire ATLAS source-code was recompiled on the target machine (with the appropriate tuning). Unfortunately, I could not deselect cpu-throttling; but I doubt if that would make too much difference in performance.
    I Compiled two versions - one with multithreading and one without. Since it is not possible to control the number of threads in the multithreaded version without recompiling from scratch, I could run ATLAS either with #cores=1 or #cores=48.
  2. EIGEN: The library is only header files. Defined EIGEN_NO_DEBUG to prevent run-time assertions. The single-threaded version was compiled using -DEIGEN_DONT_PARALLELIZE. Obviously, I used EIGEN's native BLAS implementation and did not link it to MKL.
  3. MKL: The multi-threaded version was linked to mkl_gnu_thread (and not mkl_intel_thread) and the single-threaded version was linked to mkl_sequential_thread.
  4. ACML: Since ACML is not compatible with gcc, I used Open64 compiler to compile the program.
  5. OpenBLAS: The OpenBLAS source-code was recompiled on the target machine to perform appropriate tuning. Unfortunately, there is no native single-threaded version of OpenBLAS. I mimicked the single-threaded version by setting OMP_NUM_THREADS=1, note that this could 'potentially' lead to slightly lower performance.
The compiler flags was set to "-O4 -msse2 -msse3 -msse4". The number of threads for EIGEN, MKL, ACML and OpenBLAS was controlled using OMP_NUM_THREADS. All implementations except OpenBLAS have a separate single-threaded library as well.

Single Core Timing


I performed a sequence of matrix multiplications on squared matrices with increasingly larger dimensions. The time-taken for each BLAS implementation is reported (averaged over 5 runs) is shown below,

Matrix-Dimension ATLASEIGENMKLOpenBLASACML
5000.0394070.0495050.0373110.036360.03682
10000.3062880.3670530.2938550.2847420.28739
15001.0252381.2363860.9873540.9549380.960183
20002.4116132.8856632.3382112.2525742.261928
25004.7069495.7144844.5600494.4104734.404316
30008.1322559.7662087.8574677.5850387.586482
350012.90783115.58462112.51122312.06831712.029348
400019.23962822.9712518.62350517.99118917.918091
450027.43194133.3203726.50099325.61575925.623473
500037.62807244.88282436.37691335.24607535.053436

After a few lines of R,
dat = read.table("core1.dat",
                        header = TRUE,
                        row.names = "Matrix-Dimension",
                        check.names = FALSE);
png("core1.png" , width=800 , height=500 );
par(mar=c(5,5,2,2));
barplot(t(as.matrix(dat)),
        beside = TRUE,
        col = c("LightBlue","DeepSkyBlue","SteelBlue",
                "MediumBlue","MidnightBlue") ,
        xlab="Matrix Dimension" ,
        ylab="Time (secs)" ,
        cex.lab=1.5 ,
        cex.axis=1.5 ,
        ylim=c(0,45));
legend("topleft",
       names(dat),
       cex = 1.3,
       bty = "n",
       fill =  c("LightBlue","DeepSkyBlue","SteelBlue",
                 "MediumBlue","MidnightBlue") );
dev.off();
We have a nice graph (lower the time, better)




OpenBLAS giving the best speed is a surprise, I would've expected ACML to win hands down because I am testing on a AMD machine. Also, it seems like I am getting contrary results to EIGEN's benchmark - EIGEN performs pretty badly, it just does not scale to larger matrices. But apart from EIGEN, there does not seem to be human-noticeable difference between OpenBLAS and ACML and MKL.

Multi Core Timing


In this setting, the implementations are allowed to use multiple cores to perform the matrix-multiplication task. All results are averaged over 5 runs. I show the timing comparisons for number of cores=8, 32 and 48.

    #CORES = 8

Matrix-DimensionEIGENMKLOpenBLASACML
10000.0650140.0433620.0373480.308233
20000.4007320.3265620.2951172.356271
30001.3944911.0756460.9841617.846398
40003.1363682.516272.33007618.471795
50005.8446914.8748924.52220335.977965
600010.7125798.34787.79567261.851248
700016.62534213.21562212.40294797.811272
800025.06368319.60380718.362227145.382045
900034.5399428.01369426.256471208.371312
1000046.94247138.14106136.120998286.216164


    #CORES = 32

Matrix-DimensionEIGENMKLOpenBLASACML
10000.0235290.0164610.010410.351531
20000.4161290.1120760.0779942.515941
30000.9446310.3626740.2615658.27491
40001.7513150.7110940.62134419.235383
50002.818211.3214391.18635437.563696
60004.2708972.2882352.039764.253479
70006.0609173.5115013.23041101.239577
80008.2948165.1355324.790252150.489878
900010.7758877.258666.914698217.521048
1000014.3339169.9426579.327124295.003328


    #CORES = 48

Matrix-DimensionATLASEIGENMKLOpenBLASACML
10000.0430540.0245240.0176960.0076970.381892
20000.1332910.2013830.1364360.0545892.639059
30000.3557041.2725640.4338310.1828588.563053
40000.7990051.517471.0095790.48400619.069397
50001.5094741.2171921.0186030.83155238.072026
60002.3172855.0547191.5894231.44324165.568967
70003.7274774.6778962.6283182.240029102.94063
80006.11607512.2652693.8016523.383855152.98614
90008.61130811.8337285.1215384.719928221.550268
1000010.62778810.4955856.8960776.657801299.49296


ACML is out of the window when it comes to multithreaded BLAS - the timings were so bad that I could not even plot (did I make a mistake ?). OpenBLAS seems to offer the best performance closely followed by MKL. Despite the slower performance than MKL, ATLAS still seems to do better than EIGEN.

   Scaling


I tested the how the speed of each implementation increases as the number of cores increases. The next two graphs shows the time-taken and speedup achieved on a 5000x5000 matrix multiplication by slowly increasing the number of cores from 1 to 48 (results are again averaged over 5 runs). I do not show the results of ACML and ATLAS - ACML is slow, and ATLAS cannot dynamically change the number of threads.


OpenBLAS is the only implementation that can successfully make use of any additional cores. The results of EIGEN and MKL are jagged - so that means that sometimes increasing the number of cores can hurt the performance for EIGEN and MKL. Although counter-intuitive, a closer inspection reveals that the optimal number of cores for MKL is when it is a multiple of 4. The performance of MKL reduces/not improve if the number of cores is not a multiple of 4 - this is something to note. I could not find a similar pattern for EIGEN though.

An equally informative though popularly used metric is Speed Up (how much faster is the multi-threaded version compared to a single-threaded one).


The dotted red line shows the optimal performance with perfect parallelization. OpenBLAS does come pretty close to that.

CONCLUSION


OpenBLAS seems to be the best library (atleast w.r.t to the machine I ran the tests on) for Dense-Dense matrix multiplication. It scales well with increasing number of cores as well as increasing size of matrices .

For multi-threaded applications,

OpenBLAS > MKL > ATLAS >> EIGEN > ACML

For single-threaded applications,

OpenBLAS ≈ ACML  > MKL > ATLAS >> EIGEN

May be next-time I will post some of the comparisons on Sparse-Dense matrix multiplication,

SOURCE CODE


I've attached the source-code for the programs and the plotting scripts here as well as at github

3 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. I ran similar tests many many years ago (GotoBLAS times), with similar results. Except MKL really shone above all others in single precision operations, for some reason.

    Would be interesting to see the benchmarks on single precision (sgemm) too.

    (the link to configuration file says "No content to display" btw)

    ReplyDelete
  3. What Eigen loses in speed, it gains in being a whole vector-matrix package, though, with lots of native C++ goodness.

    ReplyDelete