# Optimizers in Tensorflow

• Last Updated : 27 Jan, 2022

Optimizers are techniques or algorithms used to decrease loss (an error) by tuning various parameters and weights, hence minimizing the loss function, providing better accuracy of model faster.

## Optimizers in Tensorflow

Optimizer is the extended class in Tensorflow, that is initialized with parameters of the model but no tensor is given to it. The basic optimizer provided by Tensorflow is:

```tf.train.Optimizer - Tensorflow version 1.x
tf.compat.v1.train.Optimizer - Tensorflow version 2.x```

This class is never used directly but its sub-classes are instantiated.

Before explaining let’s first learn about the algorithm on top of which others are made .i.e. gradient descent. Gradient descent links weights and loss functions, as gradient means a measure of change, gradient descent algorithm determines what should be done to minimize loss functions using partial derivative – like add 0.7, subtract 0.27 etc. But obstacle arises when it gets stuck at local minima instead of global minima in the case of large multi-dimensional datasets.

```Syntax: tf.compat.v1.train.GradientDescentOptimizer(learning_rate,
use_locking,
Parameters:
learning_rate: rate at which algorithm updates the parameter.
Tensor or float type of value.
use_locking: Use locks for update operations if True
name: Optional name for the operation```

## Tensorflow Keras Optimizers Classes

Tensorflow predominantly supports 9 optimizer classes including its base class (Optimizer).

• SGD
• RMSprop
• FTRL

## SGD Optimizer (Stochastic Gradient Descent)

The stochastic Gradient Descent (SGD) optimization method executes a parameter update for every training example. In the case of huge datasets, SGD performs redundant calculations resulting in frequent updates having high variance causing the objective function to vary heavily.

```Syntax: tf.kears.optimizers.SGD(learning_rate = 0.01,
momentum=0.0,
nesterov=False,
name='SGD',
**kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter.
Tensor or float type of value.Default value is 0.01
momentum: accelerates gradient descent in appropriate
direction. Float type of value. Default value is 0.0
nesterov: Whether or not to apply Nesterov Momentum.
Boolean type of value. Default value is False.
name: Optional name for the operation
**kwargs: Keyworded variable length argument length.```

1. Requires Less Memory.
2. Frequent alteration of model parameters.
3. If Momentum is used then helps to reduce noise.

1. High Variance
2. Computationally Expensive

```Syntax: tf.keras.optimizers.Adagrad(learning_rate=0.001,
initial_accumulator_value=0.1,
epsilon=1e-07,
**kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter.
Tensor or float type of value.Default value is 0.001
initial_accumulator_value: Starting value for the per parameter
momentum. Floating point type of value.
Must be non-negative.Default value is 0.1
epsilon: Small value used to sustain numerical stability.
Floating point type of value. Default value is 1e-07.
name: Optional name for the operation
**kwargs: Keyworded variable length argument length```

1. Best suited for Sparse Dataset
2. Learning Rate updates with iterations

1. Learning rate becomes small with an increase in depth of neural network
2. May result in dead neuron problem

## RMSprop Optimizer

RMSprop stands for Root Mean Square Propagation. RMSprop optimizer doesn’t let gradients accumulate for momentum instead only accumulates gradients in a particular fixed window. It can be considered as an updated version of AdaGrad with few improvements. RMSprop uses simple momentum instead of Nesterov momentum.

```Syntax: tf.keras.optimizers.RMSprop(learning_rate=0.001,
rho=0.9,
momentum=0.0,
epsilon=1e-07,
centered=False,
name='RMSprop',
**kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter.
Tensor or float type of value.Default value is 0.001
rho: Discounting factor for gradients. Default value is 0.9
momentum: accelerates rmsprop in appropriate direction.
Float type of value. Default value is 0.0
epsilon: Small value used to sustain numerical stability.
Floating point type of value. Default value is 1e-07
centered: By this gradients are normalised by the variance of
gradient. Boolean type of value. Setting value to True may
help with training model however it is computationally
more expensive. Default value if False.
name: Optional name for the operation
**kwargs: Keyworded variable length argument length.```

1. The learning rate is automatically adjusted.
2. The discrete Learning rate for every parameter

Adaptive Delta (Adadelta) optimizer is an extension of AdaGrad (similar to RMSprop optimizer), however, Adadelta discarded the use of learning rate by replacing it with an exponential moving mean of squared delta (difference between current and updated weights). It also tries to eliminate the decaying learning rate problem.

```Syntax: tf.keras.optimizers.Adadelta(learning_rate=0.001,
rho=0.95,
epsilon=1e-07,
**kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter.
Tensor or float type of value.Default value is 0.001
rho: Decay rate. Tensor or Floating point type of value.
Default value is 0.95
epsilon: Small value used to sustain numerical stability.
Floating point type of value. Default value is 1e-07
name: Optional name for the operation
**kwargs: Keyworded variable length argument length```

Advantage: Setting of default learning rate is not required.

Adaptive Moment Estimation (Adam) is among the top-most optimization techniques used today. In this method, the adaptive learning rate for each parameter is calculated. This method combines advantages of both RMSprop and momentum .i.e. stores decaying average of previous gradients and previously squared gradients.

```Syntax: tf.keras.optimizers.Adam(leaarning_rate=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-07,
**kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter.
Tensor or float type of value.Default value is 0.001
beta_1: Exponential decay rate for 1st moment. Constant Float
tensor or float type of value. Default value is 0.9
beta_2: Exponential decay rate for 2nd moment. Constant Float
tensor or float type of value. Default value is 0.999
epsilon: Small value used to sustain numerical stability.
Floating point type of value. Default value is 1e-07
Default value is False.
name: Optional name for the operation
**kwargs: Keyworded variable length argument length```

1. Easy Implementation
2. Requires less memory
3. Computationally efficient

1. Can have weight decay problem
2. Sometimes may not converge to an optimal solution

AdaMax is an alteration of the Adam optimizer. It is built on the adaptive approximation of low-order moments (based off on infinity norm). Sometimes in the case of embeddings, AdaMax is considered better than Adam.

```Syntax: tf.keras.optimizers.Adamax(learning_rate=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-07,
**kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter.
Tensor or float type of value.Default value is 0.001
beta_1: Exponential decay rate for 1st moment. Constant Float
tensor or float type of value. Default value is 0.9
beta_2: Exponential decay rate for weighted infinity norm.
Constant Float tensor or float type of value.
Default value is 0.999
epsilon: Small value used to sustain numerical stability.
Floating point type of value. Default value is 1e-07
name: Optional name for the operation
**kwargs: Keyworded variable length argument length```

1. Infinite order makes the algorithm stable.
2. Requires less tuning on hyperparameters

```Syntax: tf.keras.optimizers.Nadam(learning_rate=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-07,
**kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter.
Tensor or float type of value.Default value is 0.001
beta_1: Exponential decay rate for 1st moment. Constant Float
tensor or float type of value. Default value is 0.9
beta_2: Exponential decay rate for weighted infinity norm.
Constant Float tensor or float type of value.
Default value is 0.999
epsilon: Small value used to sustain numerical stability.
Floating point type of value. Default value is 1e-07
name: Optional name for the operation
**kwargs: Keyworded variable length argument length```

1. Gives better results for gradients with high curvature or noisy gradients.
2. Learns faster

Disadvantage: Sometimes may not converge to an optimal solution

## FTRL Optimizer

Follow The Regularized Leader (FTRL) is an optimization algorithm best suited for shallow models having sparse and large feature spaces. This version supports both shrinkage-type L2 regularization (summation of L2 penalty and loss function) and online L2 regularization.

```Syntax: tf.keras.optimizers.Ftrl(learning_rate=0.001,
learning_rate_power=-0.5,
initial_accumulator_value=0.1,
l1_regularization_strength=0.0,
l2_regularization_strength=0.0,
name='Ftrl',
l2_shrinkage_regularization_strength=0.0,
beta=0.0,
**kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter.
Tensor or float type of value.Default value is 0.001
learning_rate_power: Controls the drop in learning rate during
training. Float type of value. Should be less
than or equal to 0. Default value is -0.5.
initial_accumulator_value: Initial value for accumulator. Value
should be greater than or equal to zero.
Default value is 0.1.
l1_regularization_strength:Stabilization penalty.
Only positive values or 0 is allowed.
Float type of value.Default value is 0.0
l2_regularization_strength: Stabiliztion Penalty.
Only positive values or 0 is allowed.
Float type of value.Default value is 0.0
name: Optional name for the operation
l2_shrinkage_regularization_strength: Magnitude Penalty.
Only positive values or 0 is allowed.
Float type of value.Default value is 0.0
beta: Default float value is 0.0
**kwargs: Keyworded variable length argument length```

Advantage: Can minimize loss function better.