Understanding Optimizers in Neural Networks with Keras


   To improve the accuracy and reduce the loss, we need to train the neural networks by using optimization algorithms. Neural network optimization is a process to fit the model with training data by updating the weights to get the best performance. In this tutorial, we'll briefly learn about some of the mainly used optimizers such as SGD, RMSProp, Adam, Adagrad, Adamax and their implementations in neural network training with Keras API. The post covers:
  1. Optimizers: SGD, RMSProp, Adam, Adagrad, Adamax
  2. Performance evaluation
  3. Source code listing

Optimizers

   As stated above the optimizers are used to increase the accuracy and to reduce the loss in model training. One of the famous algorithms in optimization is gradient descent.
   Gradient descent algorithm finds out the minimum coefficients in a function that represents the closest value by moving iteratively to the direction of a negative slope. Step in an iteration is called the learning rate. Batch gradient descent calculates the gradients for the entire training dataset.   
   In neural networks, the output of a given epoch is compared to the expected values and error is calculated. Based on this error rate weights are updated and network repeats the process again. It is called backpropagation. Several types of optimizers are available to train the neural networks. We'll see some of the mainly used optimizers provided by Keras API.

SGD
   SGD - Stochastic gradient descent optimizer updates the parameters for each training example. It eliminates the method of computing the entire data in every epoch like batch gradient descent does. We can implement SGD with default values or by setting the parameters.
 
RMSProp
   RMSProp (Root Mean Squared Propagation) is a gradient-based optimizer and similar to Adagrad. It applies the exponential moving average of the squared gradients to adjust the learning rate.

Adam
   Adam (Adaptive Moment Estimation) is a gradient descent-based optimizer combined with the advantages of RMSProp and Adagrad. The method computes the adaptive learning rate for each parameter and applies bias-correction.

Adagrad
   Adagrad adapts the learning rate with smaller updates according to the gradient value of the independent variable. It is good working with sparse gradients. 

Adamax
   Adamax is a version of Adam and replaces the L² norm-based update to the L^p infinity norm rule.

Hyperparameters of optimizers
    Some of the key hyperparameters of optimizers are a momentum and learning rate.
  • The momentum method keeps variable updates more consistent to move in the same direction. It helps to increase the learning rate by allowing the weight to incorporate the previous weight updates.
  • The learning rate defines the learning rate of the model. It is a learning step.

Bayesian Ridge Regression Example in Python


   Bayesian regression can be implemented by using regularization parameters in estimation. The BayesianRidge estimator applies Ridge regression and its coefficients to find out a posteriori estimation under the Gaussian distribution.
   In this post, we'll learn how to use the scikit-learn's BayesianRidge estimator class for a regression problem. The tutorial covers:
  1. Preparing the data
  2. How to use the model
  3. Source code listing
We'll start by loading the required modules.

from sklearn.linear_model import BayesianRidge
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from numpy import sqrt


Preparing the data

In this tutorial, we'll use the Boston-housing dataset. We'll load the dataset and split it into the train and test parts.

boston = load_boston()
x, y = boston.data, boston.target
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)

Least Angle Regression Example in Python


   Regression algorithm Least Angle Regression (LARS) provides the response by the linear combination of variables for high-dimensional data. It relates to forward stepwise regression. In this method, the most correlated variable is selected in each step in a direction that is equiangular between the two predictors.
   In this tutorial, we'll learn how to fit regression data with LARS and Lasso Lars algorithms in Python. We'll use the sklearn's Lars and LarsLasso estimators and the Boston housing dataset in this tutorial. The post covers:
  1. Preparing the data
  2. How to use LARS
  3. How to use Lasso LARS
  4. Source code listing
 Let's get started by loading the required packages.

from sklearn import linear_model
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from numpy import sqrt


Preparing the data

We'll load the Boston dataset and split it into the train and test parts.

boston = load_boston()
x, y = boston.data, boston.target
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)

Understanding Activation Functions with Python


   The activation function is one of the important building blocks of neural networks. Based on input data, coming from one or multiple outputs from the neurons from the previous layer, the activation function decides to activate the neuron or not. The activation decision comes after the summing up the inputs and their weights and adding the bias value. This process provides the nonlinearity between the input and output values in a network.
   In this tutorial, we'll learn some of the mainly used activation function in neural networks like sigmoid, tanh, ReLU, and Leaky ReLU and their implementation with Keras in Python. The tutorial covers:
  1. Sigmoid function
  2. Tanh function
  3. ReLU (Rectified Linear Unit) function
  4. Leaky ReLU function
We'll start by loading the following libraries.

import numpy as np
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Activation, Dense, LeakyReLU 

To check the performance of the activation function, we'll use x generated sequence data.

x = np.arange(-5, 5, 0.1)
print(x[1:10])
[-4.9 -4.8 -4.7 -4.6 -4.5 -4.4 -4.3 -4.2 -4.1]


Sigmoid function

Sigmoid function transforms input value to the output between the range from 0 and 1. It is also called a logistic function and the curve of a function looks S-shaped. It is used in cases like making the final decision in the binary classification layer in a network.
Let's define the function in Python.