On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the...
Transcript of On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the...
![Page 1: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/1.jpg)
On the Computation and Communication Complexity of Parallel
SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization
Hao Yu, Rong JinMachine Intelligence TechnologyAlibaba Group (US) Inc., Bellevue, WA
Poster @ Pacific Ballroom #103
![Page 2: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/2.jpg)
Stochastic Non-Convex Optimization
• Stochastic non-convex optimization
minx∈ℛm
f(x) Δ= $ζ∼D[F(x; ζ)]
![Page 3: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/3.jpg)
Stochastic Non-Convex Optimization
• Stochastic non-convex optimization
• SGD:
minx∈ℛm
f(x) Δ= $ζ∼D[F(x; ζ)]
xt+ 1 = xt − γ 1B ∑B
i= 1 ∇F(xt; ζi)stochastic gradient averaged from a mini-batch of size B
![Page 4: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/4.jpg)
Stochastic Non-Convex Optimization
• Stochastic non-convex optimization
• SGD:
• Singe node training:
• Larger B can improve the utilization of computing hardware
minx∈ℛm
f(x) Δ= $ζ∼D[F(x; ζ)]
xt+ 1 = xt − γ 1B ∑B
i= 1 ∇F(xt; ζi)stochastic gradient averaged from a mini-batch of size B
![Page 5: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/5.jpg)
Stochastic Non-Convex Optimization
• Stochastic non-convex optimization
• SGD:
• Singe node training:
• Larger B can improve the utilization of computing hardware
• Data-parallel training:
• Multiple nodes form a bigger “mini-batch” by aggregating individual mini-batch gradients at each step.
• Given a budget of gradient access, larger batch size yields fewer update/comm steps
minx∈ℛm
f(x) Δ= $ζ∼D[F(x; ζ)]
xt+ 1 = xt − γ 1B ∑B
i= 1 ∇F(xt; ζi)stochastic gradient averaged from a mini-batch of size B
![Page 6: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/6.jpg)
Batch size for (parallel) SGD
• Question: Should always use a BS as large as possible in (parallel) SGD?
![Page 7: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/7.jpg)
Batch size for (parallel) SGD
• Question: Should always use a BS as large as possible in (parallel) SGD?
• You may tend to say “yes” because in strongly convex case, SGD with extremely large BS is close to GD?
![Page 8: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/8.jpg)
Batch size for (parallel) SGD
• Question: Should always use a BS as large as possible in (parallel) SGD?
• You may tend to say “yes” because in strongly convex case, SGD with extremely large BS is close to GD?
• Theoretically, No! [Bottou&Bousquet’08] [Bottou et. al.’18] shows that with limited budgets of stochastic gradient ҁStochastic First Order) access, GD (SGD with extremely large BS) has slower convergence than SGD with small batch sizes.
![Page 9: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/9.jpg)
Batch size for (parallel) SGD
• Question: Should always use a BS as large as possible in (parallel) SGD?
• You may tend to say “yes” because in strongly convex case, SGD with extremely large BS is close to GD?
• Theoretically, No! [Bottou&Bousquet’08] [Bottou et. al.’18] shows that with limited budgets of stochastic gradient ҁStochastic First Order) access, GD (SGD with extremely large BS) has slower convergence than SGD with small batch sizes.
• Under a finite SFO access budget, [Bottou et. al.’18] shows SGD with B=1 achieves better stochastic opt error than GD.
![Page 10: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/10.jpg)
Batch size for (parallel) SGD
• Question: Should always use a BS as large as possible in (parallel) SGD?
• You may tend to say “yes” because in strongly convex case, SGD with extremely large BS is close to GD?
• Theoretically, No! [Bottou&Bousquet’08] [Bottou et. al.’18] shows that with limited budgets of stochastic gradient ҁStochastic First Order) access, GD (SGD with extremely large BS) has slower convergence than SGD with small batch sizes.
• Under a finite SFO access budget, [Bottou et. al.’18] shows SGD with B=1 achieves better stochastic opt error than GD.
• Recall B=1 means poor hardware utilization and huge communication cost
![Page 11: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/11.jpg)
Dynamic BS: reduce communication without sacrificing SFO convergence
• Motivating result: For strongly convex stochastic opt, [Friedlander&Schmidt’12] and [Bottou et.al.’18] show that SGD with exponentially increasing BS can achieve the same SFO convergence as SGD with fixed small BSO(1/T )
![Page 12: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/12.jpg)
Dynamic BS: reduce communication without sacrificing SFO convergence
• Motivating result: For strongly convex stochastic opt, [Friedlander&Schmidt’12] and [Bottou et.al.’18] show that SGD with exponentially increasing BS can achieve the same SFO convergence as SGD with fixed small BS
• This paper explores how to use dynamic BS for non-convex opt such that:
O(1/T )
![Page 13: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/13.jpg)
Dynamic BS: reduce communication without sacrificing SFO convergence
• Motivating result: For strongly convex stochastic opt, [Friedlander&Schmidt’12] and [Bottou et.al.’18] show that SGD with exponentially increasing BS can achieve the same SFO convergence as SGD with fixed small BS
• This paper explores how to use dynamic BS for non-convex opt such that:
• do not sacrifice SFO convergence in (parallel) SGD. Recall (N node parallel) SGD with (B=1) has SFO convergence
O(1/T )
O(1/ NT)T: SFO access budge at each node
Linear speedup w.r.t. # of nodes; computation power perfectly scaled out
![Page 14: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/14.jpg)
Dynamic BS: reduce communication without sacrificing SFO convergence
• Motivating result: For strongly convex stochastic opt, [Friedlander&Schmidt’12] and [Bottou et.al.’18] show that SGD with exponentially increasing BS can achieve the same SFO convergence as SGD with fixed small BS
• This paper explores how to use dynamic BS for non-convex opt such that:
• do not sacrifice SFO convergence in (parallel) SGD. Recall (N node parallel) SGD with (B=1) has SFO convergence
O(1/T )
O(1/ NT)T: SFO access budge at each node
Linear speedup w.r.t. # of nodes; computation power perfectly scaled out
![Page 15: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/15.jpg)
Dynamic BS: reduce communication without sacrificing SFO convergence
• Motivating result: For strongly convex stochastic opt, [Friedlander&Schmidt’12] and [Bottou et.al.’18] show that SGD with exponentially increasing BS can achieve the same SFO convergence as SGD with fixed small BS
• This paper explores how to use dynamic BS for non-convex opt such that:
• do not sacrifice SFO convergence in (parallel) SGD. Recall (N node parallel) SGD with (B=1) has SFO convergence
• reduce communication complexity (# of used batches) in parallel SGD
O(1/T )
O(1/ NT)T: SFO access budge at each node
Linear speedup w.r.t. # of nodes; computation power perfectly scaled out
![Page 16: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/16.jpg)
Non-Convex under PL condition
• PL condition:
• Milder than strong convexity: strong convexity implies PL condition.
• Non-convex fun under PL is typically as nice as strong convex fun.
12 ∥∇f(x)∥2 ≥ μ( f(x) − f*), ∀x
Brief Article
The Author
June 4, 2019
Algorithm 1 CR-PSGD(f,N, T,x1, B1, ⇢, �)
1: Input: N , T , x1 2 Rm, � , B1 and ⇢ > 1.2: Initialize t = 13: while
Pt⌧=1B⌧ T do
4: Each worker calculates batch gradient average gt,i =1Bt
PBtj=1 F (xt; ⇣i,j).
5: Each worker aggregates gradient average gt =1N
PNi=1 gt,i.
6: Each worker updates in parallel via: xt+1 = xt � �gt.7: Set batch size Bt+1 = b⇢tB1c.8: Update t t+ 1.9: end while
10: Return: xt
1
budge of SFO access at each worker
![Page 17: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/17.jpg)
Non-Convex under PL condition
• Under PL, we show using exponentially increasing batch sizes in PSGD with N workers has SFO convergence with comm rounds
• SoA SFO convergence with inter-worker comm rounds attained by local SGD in [Stich’18] for strongly convex opt only
O(log T )O( 1NT )
O( 1NT ) O( NT )
![Page 18: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/18.jpg)
Non-Convex under PL condition
• Under PL, we show using exponentially increasing batch sizes in PSGD with N workers has SFO convergence with comm rounds
• SoA SFO convergence with inter-worker comm rounds attained by local SGD in [Stich’18] for strongly convex opt only
• How about general non-convex without PL?
O(log T )O( 1NT )
O( 1NT ) O( NT )
![Page 19: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/19.jpg)
Non-Convex under PL condition
• Under PL, we show using exponentially increasing batch sizes in PSGD with N workers has SFO convergence with comm rounds
• SoA SFO convergence with inter-worker comm rounds attained by local SGD in [Stich’18] for strongly convex opt only
• How about general non-convex without PL?
• Inspiration from “catalyst acceleration” developed in [Lin et al.’15][Paquette et al.’18]
• Instead of solving original problem directly, it repeatedly solves “strongly convex” proximal minimization
O(log T )O( 1NT )
O( 1NT ) O( NT )
![Page 20: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/20.jpg)
General Non-Convex Opt
• A new catalyst-like parallel SGD methodAlgorithm 2 CR-PSGD-Catalyst(f,N, T,y0, B1, ⇢, �)
1: Input: N , T , ✓, y0 2 Rm, � , B1 and ⇢ > 1.2: Initialize y(0) = y0 and k = 1.3: while k b
pNT c do
4: Define h✓(x;y(k�1))�= f(x) + ✓
2kx� y(k�1)k2 .
5: Update y(k) via
y(k) = CR-PSGD(h✓(·;y(k�1)), N, bp
T/Nc,y(k�1), B1, ⇢, �)
6: Update k k + 1.7: end while
2
strongly convex fun whose unbiased stochastic gradient is easily estimated
• We show this catalyst-like parallel SGD (with dynamic BS) has SFO convergence with comm rounds
• SoA is SFO convergence with inter-worker comm rounds
O(1/ NT )O( NT log( T
N ))
O(1/ NT ) O(N3/4T3/4)
![Page 21: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/21.jpg)
Experiments
Distributed Logistic Regression: N=10
![Page 22: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/22.jpg)
Experiments
Training ResNet20 over Cifar10: N=8
![Page 23: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14 · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic](https://reader033.fdocumento.com/reader033/viewer/2022053107/6073c9718f664d116a596605/html5/thumbnails/23.jpg)
Thanks!
Poster on Wed Jun 12th 06:30 -- 09:00 PM @ Pacific Ballroom #103