defupdate_core_cpu(self, param):
grad = param.grad
if grad isNone:return
v = self.state['v']ifisinstance(v, intel64.mdarray):
v.inplace_axpby(self.hyperparam.momentum,-
self.hyperparam.lr, grad)
param.data += v
else:
v *= self.hyperparam.momentum
v -= self.hyperparam.lr * grad
param.data += v
initialize VdW =0, vdb =0//VdW维度与dW一致,Vdb维度与db一致
on iteration t:
compute dW,db on current mini-batch
VdW = beta*VdW +(1-beta)*dW
Vdb = beta*Vdb +(1-beta)*db
W = W - learning_rate * VdW
b = b - learning_rate * Vdb
具体的代码实现为:
definitialize_velocity(parameters):"""
Initializes the velocity as a python dictionary with:
- keys: "dW1", "db1", ..., "dWL", "dbL"
- values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
Arguments:
parameters -- python dictionary containing your parameters.
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
Returns:
v -- python dictionary containing the current velocity.
v['dW' + str(l)] = velocity of dWl
v['db' + str(l)] = velocity of dbl
"""
L =len(parameters)//2# number of layers in the neural networks
v ={}# Initialize velocityfor l inrange(L):
v["dW"+str(l +1)]= np.zeros(parameters["W"+str(l+1)].shape)
v["db"+str(l +1)]= np.zeros(parameters["b"+str(l+1)].shape)return v
defupdate_parameters_with_momentum(parameters, grads, v, beta, learning_rate):"""
Update parameters using Momentum
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- python dictionary containing the current velocity:
v['dW' + str(l)] = ...
v['db' + str(l)] = ...
beta -- the momentum hyperparameter, scalar
learning_rate -- the learning rate, scalar
Returns:
parameters -- python dictionary containing your updated parameters
v -- python dictionary containing your updated velocities
"""
L =len(parameters)//2# number of layers in the neural networks# Momentum update for each parameterfor l inrange(L):# compute velocities
v["dW"+str(l +1)]= beta * v["dW"+str(l +1)]+(1- beta)* grads['dW'+str(l +1)]
v["db"+str(l +1)]= beta * v["db"+str(l +1)]+(1- beta)* grads['db'+str(l +1)]# update parameters
parameters["W"+str(l +1)]= parameters["W"+str(l +1)]- learning_rate * v["dW"+str(l +1)]
parameters["b"+str(l +1)]= parameters["b"+str(l +1)]- learning_rate * v["db"+str(l +1)]return parameters, v
下面用sklearn中的breast_cancer数据集来对比下只用SGD(stochastic mini-batch gradient descent)和SGD with momentum:
这些框架用的全是同一个公式:(提示:把上面的公式第一步代入第二步可以得到相同的公式: w = w + β 2 V − ( 1 + β ) α ∗ g r a d w = w + \beta^2 V - (1 + \beta) \alpha*grad w=w+β2V−(1+β)α∗grad)。我这里直接使用keras风格公式了,其中 β \beta β 是动量参数, α \alpha α 是学习率。 \begin{array}{lcl} VdW &=& \beta VdW - \alpha dW \\ Vdb &=& \beta Vdb - \alpha db \\ W &=& W + \beta VdW - \alpha dW \\ b &=& b + \beta Vdb - \alpha db \end{array}
这样写的高明之处在于,我们没必要去计算 ∇ θ ∑ i L ( f ( x ( i ) ; θ + α v ) , y ( i ) ) \nabla_\theta\sum_iL(f(x^{(i)};\theta + \alpha v),y^{(i)}) ∇θ∑iL(f(x(i);θ+αv),y(i)) 了,直接利用当前已求得的 d θ d\theta dθ 去更新参数即可。这样就节省了一半的时间。
写到这里大家肯定会有一个疑问,为什么这个公式就是正确的,这和原公式(论文中和《deep learning》书中提到的公式)形式明显不一样。下面我来推导证明这两个公式为什么是一样的: 先看原始公式: (5) V t + 1 = β V t − α ∇ θ t L ( θ t + β V t ) θ t + 1 = θ t + V t + 1 \begin{array}{lcl} V_{t+1} &=& \beta V_t - \alpha \nabla_{\theta_t}L(\theta_t + \beta V_t) \\\\ \theta_{t+1} &=& \theta_{t} + V_{t+1} \tag{5} \end{array} Vt+1θt+1==βVt−α∇θtL(θt+βVt)θt+Vt+1(5)
下面是推导过程(参考资料:1、cs231n中Nesterov Momentum段中However, in practice people prefer to express the update…这段话 2、9987 用Theano实现Nesterov momentum的正确姿势): 首先令 θ t ′ = θ t + β V t {\theta _{t}}' = \theta _{t} + \beta V_t θt′=θt+βVt,则 V t + 1 = β V t − α ∇ θ t L ( θ t ′ ) V_{t+1} = \beta V_t - \alpha \nabla_{\theta_t}L({\theta _{t}}') Vt+1=βVt−α∇θtL(θt′) 则: (6) θ t + 1 ′ = θ t + 1 + β V t + 1 = θ t + V t + 1 + β V t + 1 = θ t + ( 1 + β ) V t + 1 = θ t ′ − β V t + ( 1 + β ) [ β V t − α ∇ θ t ′ L ( θ t ′ ) ] = θ t ′ + β 2 V t − ( 1 + β ) α ∇ θ t ′ L ( θ t ′ ) \begin{array}{lcl} {\theta _{t+1}}' &=& \theta _{t+1} + \beta V_{t+1} \\\\ &=& \theta_t + V_{t+1} + \beta V_{t+1}\\\\ &=& \theta_t + (1+ \beta)V_{t+1}\\\\ \tag{6} &=& {\theta _{t}}' - \beta V_t + (1+\beta)[\beta V_t - \alpha \nabla_{{\theta _{t}}'}L({\theta _{t}}')] \\\\ &=& {\theta _{t}}' + \beta^{2}V_t - (1+\beta) \alpha \nabla_{{\theta _{t}}'}L({\theta _{t}}') \end{array} θt+1′=====θt+1+βVt+1θt+Vt+1+βVt+1θt+(1+β)Vt+1θt′−βVt+(1+β)[βVt−α∇θt′L(θt′)]θt′+β2Vt−(1+β)α∇θt′L(θt′)(6) 下面替换回来,令 θ t = θ t ′ \theta _{t} = {\theta _{t}}' θt=θt′,则上述公式为: θ t + 1 = θ t + β 2 V t − ( 1 + β ) α ∇ θ t L ( θ t ) \theta _{t+1} = \theta _{t}+ \beta^{2}V_t - (1+\beta) \alpha \nabla_{\theta _{t}}L(\theta _{t}) θt+1=θt+β2Vt−(1+β)α∇θtL(θt)
下面是实现的代码(初始化和momentum一样就不贴出初始化的代码了):
#nesterov momentumdefupdate_parameters_with_nesterov_momentum(parameters, grads, v, beta, learning_rate):"""
Update parameters using Momentum
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- python dictionary containing the current velocity:
v['dW' + str(l)] = ...
v['db' + str(l)] = ...
beta -- the momentum hyperparameter, scalar
learning_rate -- the learning rate, scalar
Returns:
parameters -- python dictionary containing your updated parameters
v -- python dictionary containing your updated velocities
'''
VdW = beta * VdW - learning_rate * dW
Vdb = beta * Vdb - learning_rate * db
W = W + beta * VdW - learning_rate * dW
b = b + beta * Vdb - learning_rate * db
'''
"""
L =len(parameters)//2# number of layers in the neural networks# Momentum update for each parameterfor l inrange(L):# compute velocities
v["dW"+str(l +1)]= beta * v["dW"+str(l +1)]- learning_rate * grads['dW'+str(l +1)]
v["db"+str(l +1)]= beta * v["db"+str(l +1)]- learning_rate * grads['db'+str(l +1)]# update parameters
parameters["W"+str(l +1)]+= beta * v["dW"+str(l +1)]- learning_rate * grads['dW'+str(l +1)]
parameters["b"+str(l +1)]+= beta * v["db"+str(l +1)]- learning_rate * grads["db"+str(l +1)]return parameters
论文《An overview of gradient descent optimization algorithms》中写道:This anticipatory update prevents us from going too fast and results in increased responsiveness, which has significantly increased the performance of RNNs on a number of tasks
五、AdaGrad(Adaptive Gradient)
通常,我们在每一次更新参数时,对于所有的参数使用相同的学习率。而AdaGrad算法的思想是:每一次更新参数时(一次迭代),不同的参数使用不同的学习率。AdaGrad 的公式为: (7) G t = G t − 1 + g t 2 θ t + 1 = θ t − α G t + ε ⋅ g t \begin{array}{lcl} G_t &=& G_{t-1} + g^2_t \\\\ \theta_{t+1} &=& \theta_{t} - \frac{\alpha }{\sqrt{G_t } + \varepsilon}\cdot g_t \tag{7} \end{array} Gtθt+1==Gt−1+gt2θt−Gt+εα⋅gt(7)
从公式中我们能够发现: 优点:对于梯度较大的参数, G t G_t Gt相对较大,则 α G t + ε \frac{\alpha }{\sqrt{G_t }+ \varepsilon } Gt+εα较小,意味着学习率会变得较小。而对于梯度较小的参数,则效果相反。这样就可以使得参数在平缓的地方下降的稍微快些,不至于徘徊不前。 缺点:由于是累积梯度的平方,到后面 G t G_t Gt 累积的比较大,会导致梯度 α G t + ε → 0 \frac{\alpha }{\sqrt{G_t }+ \varepsilon } \rightarrow 0 Gt+εα→0,导致梯度消失。
具体到实现,对于 i t e r a t i o n t , c o m p u t e o n c u r r e n t m i n i − b a t c h : iteration \ t,compute \ on \ current \ mini-batch: iterationt,computeoncurrentmini−batch: (8) G w = G w + ( d W ) 2 W = W − α G w + ε ⋅ d W G b = G b + ( d b ) 2 b = b − α G b + ε ⋅ d b \begin{array}{lcl} G_w &=& G_{w} + (dW)^2 \\ W &=& W - \frac{\alpha }{\sqrt{G_w }+ \varepsilon }\cdot dW \\\\ G_b &=& G_{b} + (db)^2 \\ \tag{8} b &=& b - \frac{\alpha }{\sqrt{G_b}+ \varepsilon }\cdot db \end{array} GwWGbb====Gw+(dW)2W−Gw+εα⋅dWGb+(db)2b−Gb+εα⋅db(8) 其中 ε \varepsilon ε 一般默认取 1 0 − 7 10^{-7} 10−7 (《deep learning》书和mxnet),学习率 α \alpha α 一般取值为 0.01 0.01 0.01(论文:《An overview of gradient descent optimization algorithms》)。 代码实现:
#AdaGrad initializationdefinitialize_adagrad(parameters):"""
Initializes the velocity as a python dictionary with:
- keys: "dW1", "db1", ..., "dWL", "dbL"
- values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
Arguments:
parameters -- python dictionary containing your parameters.
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
Returns:
Gt -- python dictionary containing sum of the squares of the gradients up to step t.
G['dW' + str(l)] = sum of the squares of the gradients up to dwl
G['db' + str(l)] = sum of the squares of the gradients up to db1
"""
L =len(parameters)//2# number of layers in the neural networks
G ={}# Initialize velocityfor l inrange(L):
G["dW"+str(l +1)]= np.zeros(parameters["W"+str(l +1)].shape)
G["db"+str(l +1)]= np.zeros(parameters["b"+str(l +1)].shape)return G
#AdaGraddefupdate_parameters_with_adagrad(parameters, grads, G, learning_rate, epsilon =1e-7):"""
Update parameters using Momentum
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
G -- python dictionary containing the current velocity:
G['dW' + str(l)] = ...
G['db' + str(l)] = ...
learning_rate -- the learning rate, scalar
epsilon -- hyperparameter preventing division by zero in adagrad updates
Returns:
parameters -- python dictionary containing your updated parameters
'''
GW += (dW)^2
W -= learning_rate/(sqrt(GW) + epsilon)*dW
Gb += (db)^2
b -= learning_rate/(sqrt(Gb)+ epsilon)*db
'''
"""
L =len(parameters)//2# number of layers in the neural networks# Momentum update for each parameterfor l inrange(L):# compute velocities
G["dW"+str(l +1)]+= grads['dW'+str(l +1)]**2
G["db"+str(l +1)]+= grads['db'+str(l +1)]**2# update parameters
parameters["W"+str(l +1)]-= learning_rate /(np.sqrt(G["dW"+str(l +1)])+ epsilon)* grads['dW'+str(l +1)]
parameters["b"+str(l +1)]-= learning_rate /(np.sqrt(G["db"+str(l +1)])+ epsilon)* grads['db'+str(l +1)]return parameters
简单的跑了下,准确率是0.9386,代价函数随着迭代次数的变化如下图:
六、Adadelta
Adadelta是对Adagrad的改进,主要是为了克服Adagrad的两个缺点(摘自Adadelta论文《AdaDelta: An Adaptive Learning Rate Method》):
the continual decay of learning rates throughout training
the need for a manually selected global learning rate
为了解决第一个问题,Adadelta只累积过去 w w w 窗口大小的梯度,其实就是利用前面讲过的指数加权平均,公式如下: (9) E [ g 2 ] t = ρ E [ g 2 ] t − 1 + ( 1 − ρ ) g t 2 E[g^2]_t = \rho E[g^2]_{t-1} + (1 - \rho)g^2_t \tag{9} E[g2]t=ρE[g2]t−1+(1−ρ)gt2(9) 根据前面讲到的指数加权平均,比如当 ρ = 0.9 \rho = 0.9 ρ=0.9 时,则相当于累积前10个梯度。 为了解决第二个问题,Adadelta最终的公式不需要学习率 α \alpha α。Adadelta的具体算法如下所示(来自论文《AdaDelta: An Adaptive Learning Rate Method》):
可以看到adadelta算法不在需要学习率,p.s:看了keras和Lasagne框架的adadelta源码,它们依然使用了学习率,不知道为什么。。。而chainer框架则没有使用。 按照我们的习惯重新写下上述公式: (10) S d w = ρ S d w + ( 1 − ρ ) ( d W ) 2 S d b = ρ S d b + ( 1 − ρ ) ( d b ) 2 V d w = ∇ w + ε S d w + ε ⋅ d W V d b = ∇ b + ε S d b + ε ⋅ d b W = W − V d w b = b − V d b ∇ w = ρ ∇ w + ( 1 − ρ ) ( V d w ) 2 ∇ b = ρ ∇ b + ( 1 − ρ ) ( V d b ) 2 \begin{array}{lcl} S_{dw} &=& \rho S_{dw} + (1 - \rho)(dW)^2 \\ S_{db} &=& \rho S_{db} + (1 - \rho)(db)^2 \\\\ V_{dw} &=& \sqrt{\frac{\nabla_w + \varepsilon}{S_{dw} + \varepsilon }}\cdot dW \\ V_{db} &=& \sqrt{\frac{\nabla_b + \varepsilon}{S_{db} + \varepsilon }}\cdot db \\\\ \tag{10} W &=& W - V_{dw}\\ b &=& b - V_{db}\\\\ \nabla_w &=& \rho \nabla_w + (1 - \rho)(V_{dw})^2 \\ \nabla_b &=& \rho \nabla_b + (1 - \rho)(V_{db})^2 \\\\ \end{array} SdwSdbVdwVdbWb∇w∇b========ρSdw+(1−ρ)(dW)2ρSdb+(1−ρ)(db)2Sdw+ε∇w+ε⋅dWSdb+ε∇b+ε⋅dbW−Vdwb−Vdbρ∇w+(1−ρ)(Vdw)2ρ∇b+(1−ρ)(Vdb)2(10) 其中 ε \varepsilon ε 是为了防止分母为0,通常取 1 e − 6 1e^-6 1e−6 。 下面是代码实现:
#initialize_adadeltadefinitialize_adadelta(parameters):"""
Initializes s and delta as two python dictionaries with:
- keys: "dW1", "db1", ..., "dWL", "dbL"
- values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
Arguments:
parameters -- python dictionary containing your parameters.
parameters["W" + str(l)] = Wl
parameters["b" + str(l)] = bl
Returns:
s -- python dictionary that will contain the exponentially weighted average of the squared gradient of dw
s["dW" + str(l)] = ...
s["db" + str(l)] = ...
v -- python dictionary that will contain the RMS
v["dW" + str(l)] = ...
v["db" + str(l)] = ...
delta -- python dictionary that will contain the exponentially weighted average of the squared gradient of delta_w
delta["dW" + str(l)] = ...
delta["db" + str(l)] = ...
"""
L =len(parameters)//2# number of layers in the neural networks
s ={}
v ={}
delta ={}# Initialize s, v, delta. Input: "parameters". Outputs: "s, v, delta".for l inrange(L):
s["dW"+str(l +1)]= np.zeros(parameters["W"+str(l +1)].shape)
s["db"+str(l +1)]= np.zeros(parameters["b"+str(l +1)].shape)
v["dW"+str(l +1)]= np.zeros(parameters["W"+str(l +1)].shape)
v["db"+str(l +1)]= np.zeros(parameters["b"+str(l +1)].shape)
delta["dW"+str(l +1)]= np.zeros(parameters["W"+str(l +1)].shape)
delta["db"+str(l +1)]= np.zeros(parameters["b"+str(l +1)].shape)return s, v, delta
#adadeltadefupdate_parameters_with_adadelta(parameters, grads, rho, s, v, delta, epsilon =1e-6):"""
Update parameters using Momentum
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
rho -- decay constant similar to that used in the momentum method, scalar
s -- python dictionary containing the current velocity:
s['dW' + str(l)] = ...
s['db' + str(l)] = ...
delta -- python dictionary containing the current RMS:
delta['dW' + str(l)] = ...
delta['db' + str(l)] = ...
epsilon -- hyperparameter preventing division by zero in adagrad updates
Returns:
parameters -- python dictionary containing your updated parameters
'''
Sdw = rho*Sdw + (1 - rho)*(dW)^2
Sdb = rho*Sdb + (1 - rho)*(db)^2
Vdw = sqrt((delta_w + epsilon) / (Sdw + epsilon))*dW
Vdb = sqrt((delta_b + epsilon) / (Sdb + epsilon))*dW
W -= Vdw
b -= Vdb
delta_w = rho*delta_w + (1 - rho)*(Vdw)^2
delta_b = rho*delta_b + (1 - rho)*(Vdb)^2
'''
"""
L =len(parameters)//2# number of layers in the neural networks# adadelta update for each parameterfor l inrange(L):# compute s
s["dW"+str(l +1)]= rho * s["dW"+str(l +1)]+(1- rho)*grads['dW'+str(l +1)]**2
s["db"+str(l +1)]= rho * s["db"+str(l +1)]+(1- rho)*grads['db'+str(l +1)]**2#compute RMS
v["dW"+str(l +1)]= np.sqrt((delta["dW"+str(l +1)]+ epsilon)/(s["dW"+str(l +1)]+ epsilon))* grads['dW'+str(l +1)]
v["db"+str(l +1)]= np.sqrt((delta["db"+str(l +1)]+ epsilon)/(s["db"+str(l +1)]+ epsilon))* grads['db'+str(l +1)]# update parameters
parameters["W"+str(l +1)]-= v["dW"+str(l +1)]
parameters["b"+str(l +1)]-= v["db"+str(l +1)]#compute delta
delta["dW"+str(l +1)]= rho * delta["dW"+str(l +1)]+(1- rho)* v["dW"+str(l +1)]**2
delta["db"+str(l +1)]= rho * delta["db"+str(l +1)]+(1- rho)* v["db"+str(l +1)]**2return parameters
七、RMSprop(root mean square prop)
RMSprop是hinton老爷子在Coursera的《Neural Networks for Machine Learning》lecture6中提出的,这个方法并没有写成论文发表(不由的感叹老爷子的强大。。以前在Coursera上修过这门课,个人感觉不算简单)。同样的,RMSprop也是对Adagrad的扩展,以在非凸的情况下效果更好。和Adadelta一样,RMSprop使用指数加权平均(指数衰减平均)只保留过去给定窗口大小的梯度,使其能够在找到凸碗状结构后快速收敛。直接来看下RMSprop的算法(来自lan goodfellow 《deep learning》):
在实际使用过程中,RMSprop已被证明是一种有效且实用的深度神经网络优化算法。目前它是深度学习人员经常采用的优化算法之一。keras文档中关于RMSprop写到:This optimizer is usually a good choice for recurrent neural networks.
按照我们的习惯,重写下公式: (11) S d w = β S d w + ( 1 − ρ ) ( d W ) 2 S d b = ρ S d b + ( 1 − ρ ) ( d b ) 2 W = W − α d W S d W + ε b = b − α d b S d b + ε \begin{array}{lcl} S_{dw} &=& \beta S_{dw} + (1 - \rho)(dW)^2 \\ S_{db} &=& \rho S_{db} + (1 - \rho)(db)^2 \\\\ \tag{11} W &=& W - \alpha\frac{dW}{\sqrt{S_{dW} + \varepsilon}}\\ b &=& b - \alpha\frac{db}{\sqrt{S_{db} + \varepsilon}} \end{array} SdwSdbWb====βSdw+(1−ρ)(dW)2ρSdb+(1−ρ)(db)2W−αSdW+εdWb−αSdb+εdb(11) 同样 ε \varepsilon ε 是为了防止分母为0,默认值设为 1 e − 6 1e^{-6} 1e−6。 β \beta β 默认值设为 0.9 0.9 0.9,学习率 α \alpha α 默认值设为 0.001 0.001 0.001。 具体实现为:
#RMSpropdefupdate_parameters_with_rmsprop(parameters, grads, s, beta =0.9, learning_rate =0.01, epsilon =1e-6):"""
Update parameters using Momentum
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
s -- python dictionary containing the current velocity:
v['dW' + str(l)] = ...
v['db' + str(l)] = ...
beta -- the momentum hyperparameter, scalar
learning_rate -- the learning rate, scalar
Returns:
parameters -- python dictionary containing your updated parameters
'''
SdW = beta * SdW + (1-beta) * (dW)^2
sdb = beta * Sdb + (1-beta) * (db)^2
W = W - learning_rate * dW/sqrt(SdW + epsilon)
b = b - learning_rate * db/sqrt(Sdb + epsilon)
'''
"""
L =len(parameters)//2# number of layers in the neural networks# rmsprop update for each parameterfor l inrange(L):# compute velocities
s["dW"+str(l +1)]= beta * s["dW"+str(l +1)]+(1- beta)* grads['dW'+str(l +1)]**2
s["db"+str(l +1)]= beta * s["db"+str(l +1)]+(1- beta)* grads['db'+str(l +1)]**2# update parameters
parameters["W"+str(l +1)]= parameters["W"+str(l +1)]- learning_rate * grads['dW'+str(l +1)]/ np.sqrt(s["dW"+str(l +1)]+ epsilon)
parameters["b"+str(l +1)]= parameters["b#34;+str(l +1)]- learning_rate * grads['db'+str(l +1)]/ np.sqrt(s["db"+str(l +1)]+ epsilon)return parameters
其中,几个参数推荐的默认值分别为: α = 0.001 , β 1 = 0.9 , β 2 = 0.999 , ε = 1 0 − 8 \alpha = 0.001,\beta_1 = 0.9,\beta_2 = 0.999,\varepsilon = 10^{-8} α=0.001,β1=0.9,β2=0.999,ε=10−8。 t t t 是迭代次数,每个mini-batch后,都要进行 t += 1。
下面是具体的代码:
#initialize adamdefinitialize_adam(parameters):"""
Initializes v and s as two python dictionaries with:
- keys: "dW1", "db1", ..., "dWL", "dbL"
- values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
Arguments:
parameters -- python dictionary containing your parameters.
parameters["W" + str(l)] = Wl
parameters["b" + str(l)] = bl
Returns:
v -- python dictionary that will contain the exponentially weighted average of the gradient.
v["dW" + str(l)] = ...
v["db" + str(l)] = ...
s -- python dictionary that will contain the exponentially weighted average of the squared gradient.
s["dW" + str(l)] = ...
s["db" + str(l)] = ...
"""
L =len(parameters)//2# number of layers in the neural networks
v ={}
s ={}# Initialize v, s. Input: "parameters". Outputs: "v, s".for l inrange(L):
v["dW"+str(l +1)]= np.zeros(parameters["W"+str(l +1)].shape)
v["db"+str(l +1)]= np.zeros(parameters["b"+str(l +1)].shape)
s["dW"+str(l +1)]= np.zeros(parameters["W"+str(l +1)].shape)
s["db"+str(l +1)]= np.zeros(parameters["b"+str(l +1)].shape)return v, s
#adamdefupdate_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8):"""
Update parameters using Adam
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
learning_rate -- the learning rate, scalar.
beta1 -- Exponential decay hyperparameter for the first moment estimates
beta2 -- Exponential decay hyperparameter for the second moment estimates
epsilon -- hyperparameter preventing division by zero in Adam updates
Returns:
parameters -- python dictionary containing your updated parameters
"""
L =len(parameters)//2# number of layers in the neural networks
v_corrected ={}# Initializing first moment estimate, python dictionary
s_corrected ={}# Initializing second moment estimate, python dictionary# Perform Adam update on all parametersfor l inrange(L):# Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
v["dW"+str(l +1)]= beta1 * v["dW"+str(l +1)]+(1- beta1)* grads['dW'+str(l +1)]
v["db"+str(l +1)]= beta1 * v["db"+str(l +1)]+(1- beta1)* grads['db'+str(l +1)]# Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
v_corrected["dW"+str(l +1)]= v["dW"+str(l +1)]/(1- np.power(beta1, t))
v_corrected["db"+str(l +1)]= v["db"+str(l +1)]/(1- np.power(beta1, t))# Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
s["dW"+str(l +1)]= beta2 * s["dW"+str(l +1)]+(1- beta2)* np.power(grads['dW'+str(l +1)],2)
s["db"+str(l +1)]= beta2 * s["db"+str(l +1)]+(1- beta2)* np.power(grads['db'+str(l +1)],2)# Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
s_corrected["dW"+str(l +1)]= s["dW"+str(l +1)]/(1- np.power(beta2, t))
s_corrected["db"+str(l +1)]= s["db"+str(l +1)]/(1- np.power(beta2, t))# Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
parameters["W"+str(l +1)]= parameters["W"+str(l +1)]- learning_rate * v_corrected["dW"+str(l +1)]/ np.sqrt(s_corrected["dW"+str(l +1)]+ epsilon)
parameters["b"+str(l +1)]= parameters["b"+str(l +1)]- learning_rate * v_corrected["db"+str(l +1)]/ np.sqrt(s_corrected["db"+str(l +1)]+ epsilon)return parameters
实践表明,Adam算法在很多种不同的神经网络结构中都是非常有效的。
八、该如何选择优化算法
介绍了这么多算法,那么我们到底该选择哪种算法呢?目前还没有一个共识,schaul et al 在大量学习任务上比较了许多优化算法,结果表明,RMSprop,Adadelta和Adam表现的相当鲁棒,不分伯仲。Kingma et al表明带偏差修正的Adam算法稍微好于RMSprop。总之,Adam算法是一个相当好的选择,通常会得到比较好的效果。
下面是论文《An overview of gradient descent optimization algorithms》对各种优化算法的总结:
In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates. It is identical to Adadelta, except that Adadelta uses the RMS of parameter updates in the numerator update rule. Adam, finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. Kingma et al. [10] show that its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice