[2309.10688] On the completely different regimes of Stochastic Gradient Descent

0
24


Obtain a PDF of the paper titled On the completely different regimes of Stochastic Gradient Descent, by Antonio Sclocchi and Matthieu Wyart

Obtain PDF

Summary:Trendy deep networks are skilled with stochastic gradient descent (SGD) whose key hyperparameters are the variety of knowledge thought of at every step or batch dimension $B$, and the step dimension or studying charge $eta$. For small $B$ and enormous $eta$, SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is ruled by the ”temperature” $Tequiv eta/B$. But this description is noticed to interrupt down for sufficiently massive batches $Bgeq B^*$, or simplifies to gradient descent (GD) when the temperature is small enough. Understanding the place these cross-overs happen stays a central problem. Right here, we resolve these questions for a teacher-student perceptron classification mannequin and present empirically that our key predictions nonetheless apply to deep networks. Particularly, we acquire a section diagram within the $B$-$eta$ airplane that separates three dynamical phases: (i) a noise-dominated SGD ruled by temperature, (ii) a large-first-step-dominated SGD and (iii) GD. These completely different phases additionally correspond to completely different regimes of generalization error. Remarkably, our evaluation reveals that the batch dimension $B^*$ separating regimes (i) and (ii) scale with the dimensions $P$ of the coaching set, with an exponent that characterizes the hardness of the classification drawback.

Submission historical past

From: Antonio Sclocchi [view email]
[v1]
Tue, 19 Sep 2023 15:23:07 UTC (817 KB)
[v2]
Thu, 21 Sep 2023 13:35:04 UTC (817 KB)
[v3]
Mon, 22 Jan 2024 11:26:17 UTC (935 KB)
[v4]
Tue, 27 Feb 2024 21:52:16 UTC (1,238 KB)



Supply hyperlink

LEAVE A REPLY

Please enter your comment!
Please enter your name here