Glossary term
Gradient Descent
An iterative first-order optimization method that moves variables in the direction of the negative objective-function gradient.
Definition
methodGradient descent is an iterative optimization algorithm that minimizes a differentiable objective function by repeatedly stepping opposite to the gradient.
The gradient points in the direction of steepest local increase of a differentiable function. Gradient descent therefore uses the negative gradient as a local descent direction. Its behaviour depends on the objective's smoothness, convexity, conditioning, step-size rule, stopping criterion, and numerical scaling. The method is foundational in applied mathematics, machine learning, parameter estimation, signal processing, and engineering optimization.
Gradient descent minimizes a differentiable function by repeatedly moving in the direction that most rapidly decreases the function locally. For an objective function f(x), the gradient \nabla f(x_k) at the current iterate x_k points toward steepest local increase. The negative gradient points toward steepest local decrease, so the basic update is:
where \alpha_k > 0 is the step size, often called the learning rate. Starting from an initial guess, the algorithm generates a sequence of iterates that ideally approaches a local minimum. For convex objectives, any local minimum is also a global minimum; for nonconvex objectives, the result depends on initialization, landscape geometry, and the step-size strategy.
Step size and stability
The step size is the central practical decision. If it is too small, the method converges slowly. If it is too large, the iterates can overshoot, oscillate, or diverge. A fixed step size can work when the smoothness of the objective is known. For a smooth convex function with Lipschitz gradient constant L, a step size on the order of 1/L is a common theoretical reference.
Line-search methods choose the step adaptively. Backtracking line search starts with a trial step and reduces it until the objective decreases sufficiently. This is more expensive per iteration but often more reliable than using a fixed learning rate. In large-scale machine learning, exact line search is rarely practical, so scheduled, adaptive, or stochastic step-size rules are used.
Convergence behaviour
On strongly convex smooth objectives, gradient descent can converge linearly, meaning the error is reduced by a roughly constant factor each iteration. The rate depends on the condition number of the problem. Ill-conditioned objectives produce the familiar zig-zag path: the steepest descent direction is locally correct but poorly aligned with the direct path to the minimum.
For general convex objectives that are smooth but not strongly convex, convergence is slower. For nonconvex objectives, gradient descent may converge to a stationary point that is not the desired solution. Saddle points, flat regions, poor scaling, and noisy gradients can all slow or distort the process.
Variants
Many practical algorithms are modifications of gradient descent. Momentum methods combine the current gradient with a running direction to reduce oscillation and accelerate progress. Nesterov acceleration changes where the gradient is evaluated to improve theoretical convergence for convex problems. Stochastic gradient descent estimates the gradient from one sample or a mini-batch, making optimization feasible for large datasets. Adaptive methods scale components of the update based on past gradients.
Second-order methods such as Newton’s method use curvature information and can converge faster near a solution, but they require more expensive derivative information and linear solves. Gradient descent remains valuable when the number of variables is large, the gradient is cheap relative to exact curvature, or approximate solutions are sufficient.
Engineering cautions
Gradient descent is not a black-box guarantee. Variables should be scaled so one dimension does not dominate the gradient numerically. Constraints require projection, penalty methods, barrier methods, or a different optimization formulation. Stopping criteria should consider gradient norm, objective decrease, constraint violation, and physical meaning of the design variables. In engineering models, gradient quality also matters: noisy simulations, discontinuities, meshing artifacts, and poorly converged solvers can produce descent directions that are mathematically plausible but physically misleading.