Ben Levy and Jacob Gildenblat, SagivTech

PyTorch is an incredible Deep Learning Python framework. It makes prototyping and debugging deep learning algorithms easier, and has great support for multi gpu training.
However, as always with Python, you need to be careful to avoid writing low performing code.
This gets especially important in Deep learning, where you’re spending money on all those GPUs.
Thus, speeding up your training code can have the same effect as buying many expensive GPUS.

In this post we will share a few lessons we learned while getting our PyTorch training code to run faster.

Data-loading and pre-processing

PyTorch offers a data-loader class for loading images in batches, and supports prefetching the batches using multiple worker threads.
Prefetching means that while the GPU is crunching, other threads are working on loading the data. So we can hide the IO bound latency behind the GPU computation.

PyTorch lets you write your own custom data loader/augmentation object, and then handles the multi-threading loading using DataLoader.
The multi-threading of the data loading and the augmentation, while the training forward/backward passes are done on the GPU, are crucial for a fast training loop.
To understand how this works under the hood, we will look at our own small version of this.

Data Loading

Lets create a data generator, where every input image is loaded and pre-processed in parallel from different threads.
The different threads will read the data and push batches to a shared queue.
Since the threads iterate on a generator that fetches the data from the disk, which is a common pattern, we need to take extra care to make the generator thread safe.
We also want to take extra care when using random implementations when it comes to the pre-processing step in multiple threads (like applying operations stochastically or adding random noise), using python local thread data to store numpy.random.RandomState objects for creating random number generators with different seeds can come in handy.

Now to demonstrate the usage of the loader, here is an example training loop:

Our sample code works as follows:

  • Init of model on the GPU
  • Init of two queues:
    – Input images queue: responsible for acquiring up-to 12 pre-processed input images along program execution lifetime in 4 different threads
  • Training loop, where we fetch in 0’s an input images batch and feed it to our “PytorchNetwork.train_batch” method for accomplishing an optimization step.
  • Cuda images queue: responsible for transferring input images from the “input images queue” to the GPU memory space in 1 different thread.
  • Resource termination, where we signal all threads to be terminated.

As a consequence, we see that the optimization paradigm above succeed in diminishing completely the expensive overhead caused by HOST TO GPU memory transfers, I/O and pre-process operations.

Some other lessons we learned along the way

Use cProfile to measure run times for everything.
Its as easy as this:

The generated .prof file can be visualized by the awesome utility snakeviz
With the following command:

The output looks like this:

In the above figure (made for a run for 2 training epochs, 100 batches total training session) we see that our main training function (train_batch) is consuming 82% of the training time due to PyTorch primitive building-blocks: adam.py (optimizer), and the network forward / backward passes and the loss auto-grad variable backward. The rest of the time consumption belongs to session initializations, sleeps and validation accuracy (7%) shown in the code samples but calculated at the end of every epoch.
Notice how no time is spent on data-loading / preprocessing!

Do numpy-like operations on the GPU wherever you can

PyTorch tensors can do a lot of the things NumPy can do, but on the GPU.
We had a lot of operations like argmax that were being done in num
py in the CPU.
When doing these innocent looking operations for batches of data, they add up.

Free up memory using del

This is a common pitfall for new PyTorch users, and we think it isn’t documented enough.
After you’re done with some PyTorch tensor or variable, delete it using the python del operator to free up memory.

Avoid unnecessary transfer of data from the GPU

Cuda copies are expensive.
It turned out that a lot of our cuda copies were for batch statistics: the loss, accuracy, and other data.
Instead of displaying the loss and other metrics for every batch, aggregate them on the GPU and copy them to the CPU for display at the end of every epoch.

Avoid using pytorch DataParallel layer with Tensor.cuda() in parallel

DataParallel layer is used for distributing computations across multiple GPU’s/CPU’s.
Empirically, using Pytorch DataParallel layer in parallel to calling Tensor.cuda() variations, just like shown in the code snippet with the threaded cuda queue loop, has yielded wrong training results, probably due to the immature feature as in Pytorch version 0.1.12_2. While profiling the code, indications for problems of the two already came-up in the timing relations between the two.

Use pinned memory, and use async=True to parallelize data transfer and GPU number crunching

Let’s look at the next code snippet:

Batch_labels isn’t actually needed until line 4.
By moving it to pinned memory and making an asynchronous copy to the GPU,
The GPU data copy doesn’t cause any latency since it’s done during line 3 (the model forward pass).

In this post we shared a few lessons we learned about making PyTorch training code run faster, we invite you to share your own!

Jacob Gildenblat

Team Leader Deep Learning, SagivTech

Ben Levy

Algorithms & Software Developer, SagivTech

Legal Disclaimer:

You understand that when using the Site you may be exposed to content from a variety of sources, and that SagivTech is not responsible for the accuracy, usefulness, safety or intellectual property rights of, or relating to, such content and that such content does not express SagivTech’s opinion or endorsement of any subject matter and should not be relied upon as such. SagivTech and its affiliates accept no responsibility for any consequences whatsoever arising from use of such content. You acknowledge that any use of the content is at your own risk.