Understanding Optimizers in Neural Networks with Keras

   To improve the accuracy and reduce the loss, we need to train the neural networks by using optimization algorithms. The optimizers are one of the main components of model training. Neural network optimization is a process to fit the model with training data by adjusting the weights to get the best performance. 

   In this tutorial, we'll briefly learn some of the mainly used optimizers such as SGD, RMSProp, Adam, Adagrad, Adamax, and their implementations in neural network training with Keras API. The post covers:
  1. A brief about optimizers: SGD, RMSProp, Adam, Adagrad, Adamax
  2. Implementing optimizer with Keras
  3. Source code listing

Optimizers

   As stated above the optimizers are used to increase the accuracy and to reduce the loss in model training. One of the famous algorithms in optimization is gradient descent. Before going to the other optimizers, I think we need to understand the gradient descent method.

   Gradient descent algorithm finds out the minimum coefficients in a function that represents the closest value by moving iteratively to the direction of a negative slope. Step in an iteration is called the learning rate. Batch gradient descent calculates the gradients for the entire training dataset.   

   In neural networks, the output of a given epoch is compared to the expected values, and error is calculated. Based on this error rate the weights are updated and the loss is propagated back to the beginning of the network (backpropagation) and this process is repeated for given epochs. Several types of optimizers are available to train the neural networks. We'll see some of the mainly used optimizers provided by Keras API.

SGD
   SGD - Stochastic gradient descent optimizer updates the parameters for each training example. It eliminates the method of computing the entire data in every epoch like batch gradient descent does.
 
RMSProp
   RMSProp (Root Mean Squared Propagation) is a gradient-based optimizer and similar to Adagrad. It applies the exponential moving average of the squared gradients to adjust the learning rate.

Adam
   Adam (Adaptive Moment Estimation) is a gradient descent-based optimizer combined with the advantages of RMSProp and Adagrad. The method computes the adaptive learning rate for each parameter and applies bias-correction.

Adagrad
   Adagrad adapts the learning rate with smaller updates according to the gradient value of the independent variable. It works well with sparse gradients. 

Adamax
   Adamax is a version of Adam and replaces the L² norm-based update to the L^p infinity norm rule.

Hyperparameters of optimizers
    Some of the key hyperparameters of optimizers are a momentum and learning rate.
  • The momentum method keeps variable updates more consistent to move in the same direction. It helps to increase the learning rate by allowing the weight to incorporate the previous weight updates.
  • The learning rate defines the learning rate of the model. It is a learning step.

Implementing optimizer with Keras

   We'll check the above optimizers with a given network and compare the results. We'll start by loading the required modules for this tutorial.

from sklearn.datasets import load_boston
from keras.models import Sequential
from keras.layers import Dense, BatchNormalization
from keras import optimizers
import matplotlib.pyplot as plt

Optimizers can be used in two ways in Keras. We can define the optimizer by just calling the class. This method is useful if you want to change the parameters of the optimizer.

sgd_opt = optimizers.SGD(lr=0.01)

Or we can simply set the name of optimizers in a model compilation.

model.compile(loss="mean_squared_error", optimizer='adam') 

In this tutorial, we'll use the Boston housing dataset and we'll load it.

boston = load_boston()
x, y = boston.data, boston.target

Next, we'll define the function to train the model with a given optimizer. Here, we'll collect MSE values as a metric of loss function from each optimizer.

def run_optimizer(opts):
 mses=[]
 for opt in opts:
  model = Sequential()
  model.add(Dense(16, input_dim=13, activation="relu"))
  model.add(Dense(8, activation="relu"))
  model.add(BatchNormalization())
  model.add(Dense(1, kernel_initializer="normal"))
  model.compile(loss="mean_squared_error", optimizer=opt) 
  model.fit(x, y, epochs=50, batch_size=8, verbose=0) 
  print(model.optimizer)
  mses.append(model.evaluate(x, y))  
 return mses

We'll define the optimizers list.

opt_names = ["adam","sgd", "rmsprop", "adagrad", "adamax"]

Next, we'll run the function and get the MSE results. In this method, all optimizers are used with their default parameters.

mses = run_optimizer(opt_names)

In the next method, we'll set the same learning_rate parameter for every optimizer and run the function.

sgd = optimizers.SGD(lr=0.01)
rmsprop = optimizers.RMSprop(lr=0.01)
adagrad = optimizers.Adagrad(lr=0.01)
adam = optimizers.Adam(lr=0.01)
adamax = optimizers.Adamax(lr=0.01)

opts = [adam, sgd, rmsprop, adagrad, adamax]
mses_lr = run_optimizer(opts)

We'll print the results and visualize them in a plot to compare easily.

f = plt.figure()
f.add_subplot(1,2,1)
plt.title("Default")
plt.bar(opt_names, mses)
plt.ylabel("MSE")
plt.xlabel("Optimizers")
plt.legend()
f.add_subplot(1,2,2)
plt.title("Learning rate with 0.01")
plt.bar(opt_names, mses_lr)
plt.ylabel("MSE")
plt.xlabel("Optimizers")
plt.legend()
plt.show()

print("opt, default,  lr=0.01")
for i in range(len(opt_names)):
 print("%s: %.2f, %.2f" % (opt_names[i], mses[i], mses_lr[i])) 
opt, default,  lr=0.01
adam: 24.75, 31.65
sgd: 140.17, 84.51
rmsprop: 22.50, 19.16
adagrad: 88.10, 84.04
adamax: 21.97, 23.55 

   Here, we can check the performance of the optimizers. Please note that the above results may change in every execution because of a small dataset to train the model. For a given dataset and network model, we can conclude:
  • Adammax and RMSProp are better optimizers for this case. 
  • The learning rate 0.01 for SGD and RMSProp is a good choice.
  • Adagrad and SGD are not good candidates here.
    In this tutorial, we've briefly learned optimizers and how to use them with Keras in neural networks. When you build your model you can evaluate each optimizer and apply the appropriate one for your training. The source code is listed below.


Source code listing

from sklearn.datasets import load_boston
from keras.models import Sequential
from keras.layers import Dense, BatchNormalization
from keras import optimizers
import matplotlib.pyplot as plt

boston = load_boston()
x, y = boston.data, boston.target

def run_optimizer(opts):
 mses=[]
 for opt in opts:
  model = Sequential()
  model.add(Dense(16, input_dim=13, activation="relu"))
  model.add(Dense(8, activation="relu"))
  model.add(BatchNormalization())
  model.add(Dense(1, kernel_initializer="normal"))
  model.compile(loss="mean_squared_error", optimizer=opt) 
  model.fit(x, y, epochs=50, batch_size=8, verbose=0) 
  print(model.optimizer)
  mses.append(model.evaluate(x, y))  
 return mses

opt_names = ["adam","sgd", "rmsprop", "adagrad", "adamax"]
mses = run_optimizer(opt_names)

sgd = optimizers.SGD(lr=0.01)
rmsprop = optimizers.RMSprop(lr=0.01)
adagrad = optimizers.Adagrad(lr=0.01)
adam = optimizers.Adam(lr=0.01)
adamax = optimizers.Adamax(lr=0.01)

opts = [adam, sgd, rmsprop, adagrad, adamax]
mses_lr = run_optimizer(opts)

print("opt, default,  lr=0.01")
for i in range(len(opt_names)):
 print("%s: %.2f, %.2f" % (opt_names[i], mses[i], mses_lr[i]))

f = plt.figure()
f.add_subplot(1,2,1)
plt.title("Default")
plt.bar(opt_names, mses)
plt.ylabel("MSE")
plt.xlabel("Optimizers")
plt.legend()
f.add_subplot(1,2,2)
plt.title("Learning rate with 0.01")
plt.bar(opt_names, mses_lr)
plt.ylabel("MSE")
plt.xlabel("Optimizers")
plt.legend()
plt.show()


References and further reading:
  1. Adam: A method for stochastic optimization, Diederik P., Jimmy L.,  (2015)
  2. Keras optimizers
  3. Dive into Deep Learning, Aston Z., Zachary C., Mu L., Alexander J. (2019)

No comments:

Post a Comment