Being an impatient person, I've always tried to make my 'code' run faster. In my experience, in most ML algorithms the 'core' bottleneck in computation seems to be one of the following

- (Dense/Sparse) Matrix - Vector product

- (Dense/Sparse) Matrix - Dense Matrix product

For example, consider any binary classification or regression task with

**N**examples and dimension**P**. The computational bottleneck (for training and testing) is the product of the data matrix**X**[**N**x**P**]**(sparse or dense depending on the data) and the parameter vector****w**[**P**x**1**]. In any multi-class classification task with**K**classes, the bottleneck is the product of**X**and the parameter matrix**W**[**P**x**K**].