Computer Science Thesis Oral

— 4:00pm

In Person and Virtual - ET - Gordon Bell Conference Room, Gates Hillman 5117 and Zoom

ANDERS ├śLAND , Ph.D. Candidate, Computer Science Department, Carnegie Mellon University

Efficient Deep Learning

We study various aspects related to the efficient training of deep networks. In doing so, we also discuss and contribute to various theoretical facets of the field.

An unavoidable component in dealing with very large models is the need for distributing the training over many computational devices. This usually induces a considerable communication overhead that increases the risk of under-utilization of the system resources. To address this, we show that the entropy of the weights decreases during training, which thus become highly compressible; allowing for a considerable reduction in said overhead.

It is common practice to use squashing functions, like the softmax, at the output layer of neural nets. We study the effect these functions have on the gradient signal and argue that they may contribute to the well-known vanishing gradient problem. To this end, we introduce non-squashing alternatives and provide evidence that suggests, that they improve the convergence rate.

Our main contribution is in layer-wise training of deep networks. First, we make various useful observations on the properties of hidden layers and representations. We then show that layer-wise training can match the results of full-model backprop, while considerably reducing the memory footprint of the training process. We discuss the effect of implicit interlayer regularization and introduce new conjectures on its theoretical origin. Based on these, we show that interlayer regularization can be simulated in a few simple steps. Additionally, we discuss partition-wise training, which may speed up the optimization process by allowing for larger batch sizes and improved model parallelism.

Finally, we take a look beyond gradient descent. A novel solution to fitting multilayer perceptrons to training data is introduced. While it can outperform backpropagation with stochastic gradient descent on various toy problems, it tends to overfit and be capacity-hungry on more complex real data. We discuss why and point to future ways of addressing this. This solution can be expressed in closed form, albeit we expect that it will evolve into a hybrid iterative approach. Also, we suspect that our method might be a substantially better candidate for training deep nets on quantum computers than backprop.

Thesis Committee:

Roger B. Dannenberg (Co-Chair)
Bhiksha Raj (Co-Chair)
Zico Kolter
Ruslan Salakhutdinov
Douglas Eck (Google DeepMind)

In Person and Zoom Participation. See announcement.

Add event to Google
Add event to iCal