Optimization is a cornerstone of data science, enabling us to fine-tune models by adjusting parameters to minimize or maximize an objective function. This blog explores the key components of optimization problems, their classifications, practical applications, and advanced techniques like gradient descent, stochastic gradient descent, and backpropagation for deep neural networks (DNNs).
An optimization problem typically involves:
-
Parameters: Adjustable values, denoted as
$\vartheta$ , which can be scalars, vectors, or arrays. -
Objective Function: A function
$L(\vartheta)$ that quantifies the quality of a parameter configuration, often minimized to find the optimal$\theta^*$ :$$ \theta^* = \underset{\theta \in \Theta}{\arg \min} L(\theta) $$
-
Constraints: Limitations on parameters, defining a feasible region. For example, a synthesizer knob might be restricted to a range like [0, 10].
Optimization problems are classified based on the nature of parameters and constraints:
-
Continuous vs. Discrete: Continuous optimization involves parameters in a continuous space (
$\mathbb{R}^n$ ), leveraging smoothness for efficiency. Discrete optimization deals with discrete parameter sets, like integers, which can be more complex. -
Constrained vs. Unconstrained: Constrained optimization imposes limits (e.g., equality
$c(\theta) = 0$ or inequality$c(\theta) \leq 0$ ) on parameters, such as a synthesizer knob’s physical range. Unconstrained optimization allows any parameter configuration, often leading to impractical solutions. - Convex vs. Non-Convex: Convex problems guarantee that any local minimum is the global minimum, simplifying the search. Non-convex problems may have multiple local minima or saddle points, complicating optimization.
The objective function, often called a loss function, measures how well parameters fit the desired outcome. For example, in machine learning, the objective might minimize the difference between predicted and actual values:
Constraints define the feasible set. For instance, a box constraint restricts parameters to a range (e.g.,
The geometric median in
This can be solved using SciPy’s minimize function with the Nelder-Mead method, which iteratively adjusts a simplex to find the optimal point.
For problems like line fitting (
This can be solved directly using the pseudo-inverse:
For example, given $A = \begin{pmatrix} \frac{1}{2} & 0 \ 0 & \frac{1}{2} \end{pmatrix}$, $b = \begin{pmatrix} 1 \ 1 \end{pmatrix}$, the solution is:
Consider the least squares problem above with an additional constraint
Taking partial derivatives and setting them to zero:
Solving these, we get
- Linear Least Squares: Solves convex problems directly using normal equations or pseudo-inverse, ideal for problems with a single global minimum.
- Grid Search: Evaluates the objective function across a grid of parameter values. It’s simple but inefficient in high dimensions due to the curse of dimensionality.
- Random Search: Randomly samples parameters, avoiding local minima but lacking efficiency.
- Hill Climbing: Adjusts parameters incrementally, but can get stuck in local minima.
- Simulated Annealing: Extends hill climbing with a temperature schedule, allowing occasional uphill moves to escape local minima.
- Genetic Algorithms: Mimic evolution with mutation, selection, and crossover, maintaining a population of solutions to explore the parameter space.
Convex problems benefit from specialized solvers like scipy.optimize.linprog for linear programming or minimize with methods like trust-constr for quadratic programming, ensuring efficient convergence to the global minimum.
Heuristic methods like random search or genetic algorithms are inefficient for DNNs with billions of parameters. They lack convergence guarantees and struggle with hyperparameter tuning.
For a scalar function
The Jacobian matrix generalizes this for vector functions, while the Hessian matrix ($\nabla^2 L(\theta)$) captures second-order derivatives, indicating curvature:
Eigenvalues of the Hessian reveal the nature of critical points: positive definite (minimum), negative definite (maximum), mixed signs (saddle point), or semidefinite (plateau/ridge).
Gradient descent updates parameters in the direction of the negative gradient:
For the least squares problem with $A = \begin{pmatrix} \frac{1}{2} & 0 \ 0 & \frac{1}{2} \end{pmatrix}$, $b = \begin{pmatrix} 1 \ 1 \end{pmatrix}$, starting from $x^{(0)} = \begin{pmatrix} 0 \ 0 \end{pmatrix}$, and
SGD uses a single or few training examples per iteration, addressing slow computation for large datasets:
Mini-batch SGD updates parameters using a small subset of data:
It balances speed and stability, reducing oscillations compared to SGD.
Momentum reduces oscillations in mini-batch SGD by accumulating past gradients:
Typically,
Adagrad adapts the learning rate based on past gradients:
where
RMSprop modifies Adagrad with an exponentially decaying average of squared gradients, improving performance on non-stationary problems.
Adam combines momentum and RMSprop, using moving averages of gradients and squared gradients:
Typically,
Newton’s method uses the Hessian for faster convergence:
However, computing the Hessian is costly (
DNNs approximate complex functions by minimizing:
A DNN consists of layers with linear transformations followed by nonlinear activations (e.g., ReLU, sigmoid).
Backpropagation efficiently computes gradients using the chain rule:
It adjusts weights layer by layer, enabling DNNs to learn complex patterns in fields like image classification and speech recognition.
Automatic differentiation computes exact derivatives for complex functions, as seen in libraries like Autograd and JAX. For example, computing the gradient of
- Choose the Right Algorithm: Use direct methods for least-squares, convex solvers for convex problems, and first-order methods like gradient descent when derivatives are available. For unknown problem types, use zeroth-order methods like simulated annealing.
- Continuity and Differentiability: Continuous and differentiable functions are easier to optimize. Non-differentiable functions require careful handling.
- Convergence: Convex optimization guarantees global minima, while non-convex methods may find local minima.
Optimization is critical for solving real-world problems in data science, from fitting simple models to training complex DNNs. By understanding parameters, objective functions, constraints, and advanced techniques like SGD, momentum, and backpropagation, practitioners can tackle a wide range of challenges effectively.
For more details, refer to the original lecture slides or explore SciPy’s optimization tools at SciPy Optimize.