Running in Parallel Python for Data Science

Most computers today are multi-core (two or more processors in one package), some with multiple physical CPUs. One of the major limitations of Python is that it uses one core by default. (It was created at a time when single cores were the norm.)
Data science projects require a lot of math. In particular, part of the scientific aspect of data science relies on repeated tests and experiments on different data matrices. Don’t forget that working with huge amounts of data means that most time-consuming transformations repeat observation after observation (eg identical, unrelated operations on different parts of the matrix).
Using more CPU cores speeds up the computation process by a factor that roughly matches the number of cores. For example, having four cores means working at best four times faster. You do not receive a full quadruple increment because there is an overhead when starting a parallel process – new running instances of Python must be set up with the correct information in memory and run; Thus, the improvement will be less than what can be achieved but will still be significant.
So knowing how to use more than one CPU is an advanced but incredibly useful skill for increasing the number of analytics completed, and for speeding up your processes when setting up and when using your data products.
Multiprocessing works by repeating the same code and memory content in many new Python instances (workers), compute the result for each, and return the combined results to the original main console. If the original instance is already taking up a lot of available RAM, it will not be possible to create new instances, and your device may run out of memory.
 
Multi-core parallel performance
To perform multi-core parallelism with Python, you can combine the Scikit-Learn package with the Joblib package for time-consuming operations, such as copying models to check results or searching for the best hyperparameters. In particular, Scikit-Learn allows multiple processing when
Cross-validation: Testing machine learning hypothesis results using different training and testing data
Searching the network: systematic change of hyperparameters of a machine learning hypothesis and testing its consequences
Multi-label prediction: Run an algorithm multiple times against multiple targets when there are many different target outcomes to predict at the same time
Group machine learning methods: Modeling a large set of classifiers, each independent of the other, as when using RandomForest-based modeling
You don’t have to do anything special to take advantage of parallel computations – you can activate parallelism by setting the n_jobs parameter to a number of cores more than 1 or by setting the value to -1 , which means you want to use all available CPU instances.
If you are not running your code from the console or from the IPython Notebook, it is very important that you separate your code from any package import or global variable setting in your script with the if __name __ == ‘__ main__’ command : at the beginning of any code that implements multi-core parallelism. The if statement checks whether the program is being run directly or if it is being called by the already running Python console, avoiding any confusion or error by multi-parallel operation (such as recursively calling parallel).
 
Show multiprocessing
It’s a good idea to use IPython when running a demo of how multiprocessing really saves time during data science projects. Using IPython offers the advantage of using the %timeit magic command to implement the timing. You start by loading a multi-class dataset, a complex machine-learning algorithm (Support Vector Classifier, or SVC), and performing a cross-validation to estimate reliable results from all actions.
The most important thing to know is that the procedures are getting quite large because the SVC produces 10 models, which iterate 10 times each using cross-validation, for a total of 100 models.
from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data,digits.target
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score
%timeit single_core_learning = cross_val_score(SVC(), X,
y, cv=20, n_jobs=1)
Out [1] : 1 loops, best of 3: 17.9 s per loop
After this test, you need to activate the multicore parallelism and time the results using the following commands:
%timeit multi_core_learning = cross_val_score(SVC(), X, y,
cv=20, n_jobs=-1)
Out [2] : 1 loops, best of 3: 11.7 s per loop
The example machine demonstrates a positive advantage using multicore processing, despite using a small dataset where Python spends most of the time starting consoles and running a part of the code in each one. This overhead, a few seconds, is still significant given that the total execution extends for a handful of seconds. Just imagine what would happen if you worked with larger sets of data — your execution time could be easily cut by two or three times.
Although the code works fine with IPython, putting it down in a script and asking Python to run it in a console or using an IDE may cause errors because of the internal operations of a multicore task. The solution is to put all the code under an if statement, which checks whether the program started directly and wasn’t called afterward. Here’s an example script:
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score
if __name__ == ‘__main__’:
digits = load_digits()
X, y = digits.data,digits.target
multi_core_learning = cross_val_score(SVC(), X, y,
cv=20, n_jobs=-1)