PyTorch: A Modern Library for Machine Learning

Title: PyTorch: A Modern Library for Machine Learning
Date: Monday, December 16, 2019 12PM ET/9AM PT
Duration: 1 hour

SPEAKER: Adam Paszke, Co-Author and Maintainer, PyTorch; University of Warsaw

Resources:
TechTalk Registration
PyTorch Recipes: A Problem-Solution Approach (Skillsoft book, free for ACM Members)
Concepts and Programming in PyTorch (Skillsoft book, free for ACM Members)
PyTorch for Deep Learning and Computer Vision (O’Reilly Video, free for ACM Members)
Neural Networks with TensorFlow and PyTorch (O’Reilly Learning Path, free for ACM Members)
Machine Learning, Propensity Score, & Segmentation Modeling (Skillsoft Course, free for ACM Members)

The handling of this talk is confusing. The email indicates that the talk starts at 12:00 PM EST on 12/16/2019 or 9 AM PST.

Going to the link brings up an announcement that the event is already over, and a further note that it will occur at 12:00 PM UTC (4 PM PST).

What is the actual situation?

Hi, Ron.

This link should work correctly: https://event.on24.com/wcc/r/2088405/C5C9C38E3F5BC4CE3D08BF2F5DE088D5

If you already registered and it doesn’t take you directly to the talk, click “Already Registered” and enter the email address ou used to register.

Following the ACM TechTalk, Adam Paszke was kind enough to answer some additional questions we were not able to get to during the live event. These questions and answers are presented below:

Are there any plans to have a package for developing RL algorithms in PyTorch that’s backed in some way by the PyTorch team? Or is that something for third parties to implement?

The core team tries to avoid getting too deep in domain specific utilities mostly because they’re subject to active research, and the popular methodologies change over time. We want to focus on building a solid foundation for all the other higher-level tools, so we usually try to defer the development of those to domain experts while providing them with resources to make the best use of the core we’re building. There are many packages from outside of the core PyTorch team that provide helpers for RL: GitHub - pytorch/ELF: ELF: a platform for game research with AlphaGoZero/AlphaZero reimplementation, GitHub - catalyst-team/catalyst: Accelerated deep learning R&D.

What pre-requisites are there for installing PyTorch (on win10 machine?

You can find the installation instructions at Start Locally | PyTorch. I think the easiest way to get a good build of PyTorch is to use Anaconda (conda).

What’s the most interesting non-ML use of PyTorch you have seen in your opinion?

Etalumis ([1907.03382] Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale) is a cool application of probabilistic programming approaches built on top of PyTorch to high-energy particle physics.

So the PyTorch code is similar to NumPy, how should i decide which one to pick?
Earlier, you mentioned advantages of PyTorch over NumPy in regards to machine learning. What are some benefits of using PyTorch over NumPy for more general purpose numerical computations?

There’s no single good answer to this question. All tools have their pros and cons. If you think you could benefit from e.g. being able to use GPUs, automatic differentiation, distributed computing and any of the other features that we provide then you might want to give PyTorch a try.

Why is torch conversion called torch.as_tensor, while the back conversion is np.asarray? The underscore in one case, but not in another is rather confusing!

I agree. Unfortunately even the NumPy naming conventions can be sometimes inconsistent itself, and we have been trying to normalize some of those in our code. Hence there are some slight differences, but I don’t think they contribute to many significant confusions.

Is PyTorch able to use Google’s TPUs as accelerators?

Yes! You can find the instructions in the README file of this repository: GitHub - pytorch/xla: Enabling PyTorch on XLA Devices (e.g. Google TPU)

Is the automatic differentiation implemented for complex numbers and complex activation functions?

Not yet, but it’s definitely on the roadmap.

Is there any aspect in which PyTorch is not compatible with numpy or not numpy-like ?

PyTorch as a project has a slightly different and in some cases wider scope than NumPy, but because the core abstraction is the same and we want our abstractions to compose well, then it also should compose with NumPy programs nicely. You have to be careful about some limitations of NumPy though, like the fact that you cannot move GPU arrays with NumPy directly — they have to be moved to the CPU memory first.

As a researcher, is jitscript going to be useful to me? Should I be converting all my modules to jitscript or is most of the benefit for deployment?

Good question. There are benefits to scripting models even for research purposes — the scripted versions can get optimized. During the talk I gave this example of how big of a difference JIT can make for RNNs written using the simplest possible eager code. Of course not every use case will see equally dramatic speedups, but there are cases when it can make a big difference. My rule of thumb would be to start with eager mode, and if you believe that your model is too slow then try scripting the easy parts (e.g. new activation functions should be very easy). That might help already and should not be a very significant investment of your time.

Do you have a Model Zoo like Caffe?

Yes, packages like github.com/pytorch/torchvision or GitHub - huggingface/transformers: 🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. provide a lot of pretrained models. You should also check out the PyTorch hub at PyTorch Hub | PyTorch.

Why use PyTorch in pure C++, instead of Torchv7?

Torch7 used Lua as its primary interface, so you still need a completely another language runtime. The project was based on some C libraries but their usability was much worse than the modern C++ interface of PyTorch.

Is PyTorch Script similar to the Numba library? If so, is there a resource that compares PyTorch Script and Numba?

The idea is the same, but the focus is different. Numba focuses on taking scalar level code (e.g. matrix multiplication written out explicitly) and compiling that. PyTorch Script focuses on taking code that uses the array DSL and optimizing the programs at that level.

Compare C++ to Python script, which one is more efficient in performance?

It’s hard to answer this question. PyTorch Script will have some additional overhead as compared to C++ by the means of being an interpreted language, but it can also apply some just-in-time optimization to array code that a C++ compiler would not perform. So it really depends on the use case.

For a researcher coming from TensorFlow and Keras, how big of a learning (or relearning) curve will PyTorch be?

These days all of those libraries have relatively similar interfaces, so I don’t think it would take a lot of time to transition.

Does JIT execution provide similar speedups on CPU-only execution?

Yes, it can potentially optimize CPU code as well, but we’ve spent less time on that aspect.

Can we use script for a part of the model or the loss function and have the rest as Eager?

Very good question. Yes! Torch Script integrates seamlessly with the eager infrastructures. You can still take gradients with loss.backward() or torch.autograd.grad() etc.

What is the best way to learn PyTorch?

I think I would recommend going through the official tutorials (Welcome to PyTorch Tutorials — PyTorch Tutorials 2.1.1+cu121 documentation). You can also check out the fast.ai course.

How will Pytorch support large scale distributed learning for a long term?

PyTorch already includes a set of distributed computing helpers with MPI-like programming style. We’re also working on having another paradigm that lets you execute TorchScript programs on remote machines that would also enable a single Python script to control a large pool of workers.

Is PySyft similar to PyTorch? Can PyTorch support distributed machine learning?

PySyft has a different scope than PyTorch, but it integrates with it. PyTorch does support distributed training — see the torch.distributed package.

Is there any plan to downsize the package size, probably with an option to download sub-packages, datasets as and when required?

We try to keep our packages as lean as we can, but making the library easy to install forces us to e.g. bundle multiple versions of compiled CUDA kernels. That increases the size significantly, so if you want to make them smaller you’ll either have to use the CPU-only packages, or build it from source yourself.

Is there support (existing or planned) for non-nvidia accelerators?

PyTorch does support AMD GPUs (https://rocm.github.io/pytorch.html) and Google TPUs (GitHub - pytorch/xla: Enabling PyTorch on XLA Devices (e.g. Google TPU)).

Will PyTorch.optim include more optimizer that is available from SciPy.optimize, such as more quasi-Newton methods? Right now, from my experience, the L-BFGS optimizer in PyTorch.optim seems not as stable and robust as SciPy.optimize.L-BFGS.

We always welcome issues and pull requests. If you find issues with the implementations in the library please do let us know!

Hi. Is the runtime improvement rate for LSTM cells, as shown on your slides, comparable with an alternative offered in TensorFlow, or is this an unique feature of PyTorch ? Thanks.

TensorFlow has ways of rewriting and optimizing the programs too. I haven’t benchmarked other systems, so I can’t compare them. You’ll have to run the tests yourself. I would imagine that each one of those will be better in slightly different regimes (e.g. for different input sizes)

Have you got any tips on profiling PyTorch code? There’s torch.util.bottleneck, nvvp, we can use line_profiler in Python. What tips would you give to choosing between these tools?

This is a very interesting question that is also very difficult to answer, but I’ll try. One of the most important things to understand about PyTorch is that the program that you write only queues the operations to be executed asynchronously (then whenever you e.g. try to call .item() on an array it forces your program and the stream of mathematical operations to synchronize). There are many benefits to this way of working, most notably that if your mathematical operations are beefy enough then the cost of executing the Python program is zero due to pipelining the execution. This can be visualized in the following timeline:

Python stream     | torch.matmul | Python overhead     | torch.add | program continues...   |
Math stream       | idle | torch.matmul                              | torch.add            |
Time ->

It doesn’t matter that executing Python takes time, because the matrix multiplication took long enough such that the addition operator could not execute any sooner than when it was queued. Now, when you’re benchmarking your program the main question you have to ask yourself is “is my bottleneck the math stream, or the Python stream?”. If it’s the math stream (like in the example above) then your machine is probably running at peak capacity, and so the only things that can improve execution time are algorithmic changes. If it’s the Python stream then you can try to optimize your Python code, use larger arrays, or try PyTorch Script (it has lower execution overhead than Python). Here’s a timeline bottlenecked by the Python stream:

Python stream     | torch.matmul            | Python | torch.add          | program continues...   |
Math stream       | idle | torch.matmul | idle         | torch.add | idle                          |
Time ->

Now, another important question to ask is “how do I figure out which stream is the slow one?”. To do this I usually use a combination of torch.autograd.profiler and nvvp. First, use torch.autograd.profiler and export a chrome trace to see if there are large gaps between PyTorch operations. This profiler has extremely low overhead and so the timeline is very faithful. If you see big gaps then your Python code is slow. If there are no gaps, now’s the time to use nvvp. If the GPU timeline is packed densely with kernels, then it means that the “math stream” is slowing you down. If it’s very sparse, but the torch.autograd.profiler timeline is densely packed with PyTorch operations then it means that the arrays you’re using are simply too small to amortize the overhead of managing the tensors and you’ve hit the limitations of the library.

Hello,
this PyTorch talk by A. Paszke seems to be very interesting. Unfortunately I was not able to follow it live. Is there a way to watch it again?
Thanks! :slight_smile:

Dear Bastian,

If you haven’t already registered, you can do so here, and you will have access to the recording: https://webinars.on24.com/acm/paszke