In this tutorial, we'll briefly learn some of the mainly used optimizers such as SGD, RMSProp, Adam, Adagrad, Adamax, and their implementations in neural network training with Keras API. The post covers:

- A brief about optimizers: SGD, RMSProp, Adam, Adagrad, Adamax
- Implementing optimizer with Keras
- Source code listing

**Optimizers**

As stated above the optimizers are used to increase the accuracy and to reduce the loss in model training. One of the famous algorithms in optimization is gradient descent. Before going to the other optimizers, I think we need to understand the gradient descent method.

**Gradient descent**algorithm finds out the minimum coefficients in a function that represents the closest value by moving iteratively to the direction of a negative slope. Step in an iteration is called the

**learning rate**.

**Batch gradient descent**calculates the gradients for the entire training dataset.

In neural networks, the output of a given epoch is compared to the expected values, and error is calculated. Based on this error rate the weights are updated and the loss is propagated back to the beginning of the network (backpropagation) and this process is repeated for given epochs. Several types of optimizers are available to train the neural networks. We'll see some of the mainly used optimizers provided by Keras API.

**SGD**

SGD - Stochastic gradient descent optimizer updates the parameters for each training example. It eliminates the method of computing the entire data in every epoch like batch gradient descent does.

**RMSProp**

RMSProp (Root Mean Squared Propagation) is a gradient-based optimizer and similar to Adagrad. It applies the exponential moving average of the squared gradients to adjust the learning rate.

**Adam**

Adam (Adaptive Moment Estimation) is a gradient descent-based optimizer combined with the advantages of RMSProp and Adagrad. The method computes the adaptive learning rate for each parameter and applies bias-correction.

**Adagrad**

Adagrad adapts the learning rate with smaller updates according to the gradient value of the independent variable. It works well with sparse gradients.

**Adamax**

Adamax is a version of Adam and replaces the L² norm-based update to the L^p infinity norm rule.

**Hyperparameters of optimizers**

Some of the key hyperparameters of optimizers are a momentum and learning rate.

**The momentum**method keeps variable updates more consistent to move in the same direction. It helps to increase the learning rate by allowing the weight to incorporate the previous weight updates.**The learning rate**defines the learning rate of the model. It is a learning step.

**Implementing optimizer with Keras**

We'll check the above optimizers with a given network and compare the results. We'll start by loading the required modules for this tutorial.

from sklearn.datasets import load_boston from keras.models import Sequential from keras.layers import Dense, BatchNormalization from keras import optimizers import matplotlib.pyplot as plt

Optimizers can be used in two ways in Keras. We can define the optimizer by just calling the class. This method is useful if you want to change the parameters of the optimizer.

sgd_opt = optimizers.SGD(lr=0.01)

Or we can simply set the name of optimizers in a model compilation.

model.compile(loss="mean_squared_error", optimizer='adam')

In this tutorial, we'll use the Boston housing dataset and we'll load it.

boston = load_boston() x, y = boston.data, boston.target

Next, we'll define the function to train the model with a given optimizer. Here, we'll collect MSE values as a metric of loss function from each optimizer.

def run_optimizer(opts): mses=[] for opt in opts: model = Sequential() model.add(Dense(16, input_dim=13, activation="relu")) model.add(Dense(8, activation="relu")) model.add(BatchNormalization()) model.add(Dense(1, kernel_initializer="normal")) model.compile(loss="mean_squared_error", optimizer=opt) model.fit(x, y, epochs=50, batch_size=8, verbose=0) print(model.optimizer) mses.append(model.evaluate(x, y)) return mses

We'll define the optimizers list.

opt_names = ["adam","sgd", "rmsprop", "adagrad", "adamax"]

Next, we'll run the function and get the MSE results. In this method, all optimizers are used with their default parameters.

`mses = run_optimizer(opt_names)`

In the next method, we'll set the same learning_rate parameter for every optimizer and run the function.

sgd = optimizers.SGD(lr=0.01) rmsprop = optimizers.RMSprop(lr=0.01) adagrad = optimizers.Adagrad(lr=0.01) adam = optimizers.Adam(lr=0.01) adamax = optimizers.Adamax(lr=0.01) opts = [adam, sgd, rmsprop, adagrad, adamax] mses_lr = run_optimizer(opts)

We'll print the results and visualize them in a plot to compare easily.

f = plt.figure() f.add_subplot(1,2,1) plt.title("Default") plt.bar(opt_names, mses) plt.ylabel("MSE") plt.xlabel("Optimizers") plt.legend() f.add_subplot(1,2,2) plt.title("Learning rate with 0.01") plt.bar(opt_names, mses_lr) plt.ylabel("MSE") plt.xlabel("Optimizers") plt.legend() plt.show()

print("opt, default, lr=0.01") for i in range(len(opt_names)): print("%s: %.2f, %.2f" % (opt_names[i], mses[i], mses_lr[i]))

opt, default, lr=0.01

adam: 24.75, 31.65 sgd: 140.17, 84.51 rmsprop: 22.50, 19.16 adagrad: 88.10, 84.04 adamax: 21.97, 23.55

Here, we can check the performance of the optimizers. Please note that the above results may change in every execution because of a small dataset to train the model. For a given dataset and network model, we can conclude:

- Adammax and RMSProp are better optimizers for this case.
- The learning rate 0.01 for SGD and RMSProp is a good choice.
- Adagrad and SGD are not good candidates here.

**Source code listing**

from sklearn.datasets import load_boston from keras.models import Sequential from keras.layers import Dense, BatchNormalization from keras import optimizers import matplotlib.pyplot as plt boston = load_boston() x, y = boston.data, boston.target def run_optimizer(opts): mses=[] for opt in opts: model = Sequential() model.add(Dense(16, input_dim=13, activation="relu")) model.add(Dense(8, activation="relu")) model.add(BatchNormalization()) model.add(Dense(1, kernel_initializer="normal")) model.compile(loss="mean_squared_error", optimizer=opt) model.fit(x, y, epochs=50, batch_size=8, verbose=0) print(model.optimizer) mses.append(model.evaluate(x, y)) return mses opt_names = ["adam","sgd", "rmsprop", "adagrad", "adamax"] mses = run_optimizer(opt_names) sgd = optimizers.SGD(lr=0.01) rmsprop = optimizers.RMSprop(lr=0.01) adagrad = optimizers.Adagrad(lr=0.01) adam = optimizers.Adam(lr=0.01) adamax = optimizers.Adamax(lr=0.01) opts = [adam, sgd, rmsprop, adagrad, adamax] mses_lr = run_optimizer(opts) print("opt, default, lr=0.01") for i in range(len(opt_names)): print("%s: %.2f, %.2f" % (opt_names[i], mses[i], mses_lr[i])) f = plt.figure() f.add_subplot(1,2,1) plt.title("Default") plt.bar(opt_names, mses) plt.ylabel("MSE") plt.xlabel("Optimizers") plt.legend() f.add_subplot(1,2,2) plt.title("Learning rate with 0.01") plt.bar(opt_names, mses_lr) plt.ylabel("MSE") plt.xlabel("Optimizers") plt.legend() plt.show()

**References and further reading:**

- Adam: A method for stochastic optimization, Diederik P., Jimmy L., (2015)
- Keras optimizers
- Dive into Deep Learning, Aston Z., Zachary C., Mu L., Alexander J. (2019)

## No comments:

## Post a Comment