Speeding up ScikitLearn Model Training
If your scikitlearn models are taking a bit of time to train, then there are several techniques you can use to make the processing more efficient. From optimizing your model configuration to leveraging libraries to speed up training through parallelization, you can build the best scikitlearn model possible in the least amount of time.
By Michael Galarnyk, Developer Relations at Anyscale.
ScikitLearn is an easytouse Python library for machine learning. However, sometimes scikitlearn models can take a long time to train. The question becomes, how do you create the best scikitlearn model in the least amount of time? There are quite a few approaches to solving this problem like:
 Changing your optimization function (solver).
 Using different hyperparameter optimization techniques (grid search, random search, early stopping).
 Parallelize or distribute your training with joblib and Ray.
This post gives an overview of each approach, discusses some limitations, and offers resources to speed up your machine learning workflow!
Changing your optimization algorithm (solver)
Some solvers can take longer to converge. Image from Gaël Varoquaux’s talk.
Better algorithms allow you to make better use of the same hardware. With a more efficient algorithm, you can produce an optimal model faster. One way to do this is to change your optimization algorithm (solver). For example, scikitlearn’s logistic regression, allows you to choose between solvers like ‘newtoncg’, ‘lbfgs’, ‘liblinear’, ‘sag’, and ‘saga’.
To understand how different solvers work, I encourage you to watch a talk by scikitlearn core contributor Gaël Varoquaux. To paraphrase part of his talk, a full gradient algorithm (liblinear) converges rapidly, but each iteration (shown as a white +) can be prohibitively costly because it requires you to use all of the data. In a subsampled approach, each iteration is cheap to compute, but it can converge much more slowly. Some algorithms like ‘saga’ achieve the best of both worlds. Each iteration is cheap to compute, and the algorithm converges rapidly because of a variance reduction technique. It is important to note that quick convergence doesn’t always matter in practice, and different solvers suit different problems.
Choosing the right solver for a problem can save a lot of time (code example).
To determine which solver is right for your problem, you can check out the documentation to learn more.
Different hyperparameter optimization techniques (grid search, random search, early stopping)
To achieve high performance for most scikitlearn algorithms, you need to tune a model’s hyperparameters. Hyperparameters are the parameters of a model which are not updated during training. They can be used to configure the model or training function. ScikitLearn natively contains a couple of techniques for hyperparameter tuning like grid search (GridSearchCV), which exhaustively considers all parameter combinations, and randomized search (RandomizedSearchCV), which samples a given number of candidates from a parameter space with a specified distribution. Recently, scikitlearn added the experimental hyperparameter search estimators called halving grid search (HalvingGridSearchCV) and halving random search (HalvingRandomSearch).
Successive halving is an experimental new feature in scikitlearn version 0.24.1 (January 2021). Image from documentation.
These techniques can be used to search the parameter space using successive halving. The image above shows that all hyperparameter candidates are evaluated with a small number of resources at the first iteration, and the more promising candidates are selected and given more resources during each successive iteration.
While these new techniques are exciting, there is a library called Tunesklearn that provides cuttingedge hyperparameter tuning techniques (Bayesian optimization, early stopping, and distributed execution) that can provide significant speedups over grid search and random search.
Early stopping in action. Hyperparameter set 2 is a set of unpromising hyperparameters that would be detected by Tunesklearn’s early stopping mechanisms and stopped early to avoid wasting time and resources. Image from GridSearchCV 2.0.
Features of Tunesklearn include:
 Consistency with the scikitlearn API: You usually only need to change a couple of lines of code to use Tunesklearn (example).
 Accessibility to modern hyperparameter tuning techniques: It is easy to change your code to utilize techniques like Bayesian optimization, early stopping, and distributed execution.
 Framework support: There is not only support for scikitlearn models, but other scikitlearn wrappers such as Skorch (PyTorch), KerasClassifiers (Keras), and XGBoostClassifiers (XGBoost).
 Scalability: The library leverages Ray Tune, a library for distributed hyperparameter tuning, to efficiently and transparently parallelize crossvalidation on multiple cores and even multiple machines.
Perhaps most importantly, tunesklearn is fast, as you can see in the image below.
You can see significant performance differences on an average laptop using tunesklearn. Image from GridSearchCV 2.0.
If you would like to learn more about tunesklearn, you should check out this blog post.
Parallelize or distribute your training with joblib and Ray
Resources (dark blue) that scikitlearn can utilize for single core (A), multicore (B), and multinode training (C).
Another way to increase your model building speed is to parallelize or distribute your training with joblib and Ray. By default, scikitlearn trains a model using a single core. It is important to note that virtually all computers today have multiple cores.
For the purpose of this blog, you can think of the MacBook above as a single node with 4 cores.
Consequently, there are many opportunities to speed up the training of your model by utilizing all the cores on your computer. This is especially true if your model has a high degree of parallelism, like a random decision forest.
A random decision forest is an easy type of model to parallelize as each decision tree is independent of the others.
ScikitLearn can parallelize training on a single node with joblib, which by default uses the ‘loky’ backend. Joblib allows you to choose between backends like ‘loky’, ‘multiprocessing’, ‘dask’, and ‘ray’. This is a great feature as the ‘loky’ backend is optimized for a single node and not for running distributed (multinode) applications. Running distributed applications can introduce a host of complexities like:
 Scheduling tasks across multiple machines.
 Transferring data efficiently.
 Recovering from machine failures.
Fortunately, the ‘ray’ backend can handle these details for you, keep things simple, and give you better performance. The image below shows the normalized speedup in terms of the execution time of Ray, Multiprocessing, and Dask relative to the default ‘loky’ backend.
The performance was measured on one, five, and ten m5.8xlarge nodes with 32 cores each. The performance of Loky and Multiprocessing does not depend on the number of machines because they run on a single machine. Image source.
If you would like to learn about how to quickly parallelize or distribute your scikitlearn training, you can check out this blog post.
Conclusion
This post went over a couple of ways you can build the best scikitlearn model possible in the least amount of time. There are some ways that are native to scikitlearn, like changing your optimization function (solver) or by utilizing experimental hyperparameter optimization techniques, like HalvingGridSearchCV or HalvingRandomSearch. There are also libraries that you can use as plugins like Tunesklearn and Ray to further speed up your model building. If you have any questions or thoughts about Tunesklearn and Ray, please feel free to join our community through Discourse or Slack.
Original. Reposted with permission.
Bio: Michael Galarnyk works in Developer Relations at Anyscale. You can find him on Twitter, Medium, and GitHub.
Related:
Top Stories Past 30 Days  


