What Is Gradient Descent? A Practical Explanation

aubrey3218
Dec 18, 2025
4 min read

Updated: Dec 23, 2025

Peter Lin

Gradient descent is one of those machine learning terms that gets mentioned everywhere, yet often feels abstract or intimidating to people who aren’t deep in the field. In this article, we break it down in practical terms—what gradient descent is, why it matters, and how it actually works inside modern machine learning models.

This explanation is based on a conversation from the Practical AI podcast with hosts Jeff and Peter.

Why Gradient Descent Exists

At its core, gradient descent is an optimization technique. In machine learning, optimization means adjusting a model so that it makes better predictions over time.

When a model is first created, it doesn’t know anything. Its internal parameters—called weights—start as random values. The job of training is to adjust those weights so the model’s predictions get closer and closer to the correct answers.

Gradient descent is the math-driven process that makes those adjustments. In simple terms, gradient descent is the process of finding the best parameters to billions of random adjustments.

Learning From Mistakes: Predictions and Loss

To understand gradient descent, it helps to look at what happens during training:

The model makes a prediction.
That prediction is compared to the correct answer.
A loss function measures how wrong the prediction was.
The model adjusts its weights to reduce that loss.

For example, imagine training a model on the CIFAR dataset, which includes 10 classes of images (cats, dogs, airplanes, etc.). If the model predicts “airplane” when the image is actually a “cat,” the loss function quantifies how wrong that prediction was.

Gradient descent then asks:

How should every single weight in the model change so the next prediction is a little better?

That question is answered mathematically and applied repeatedly—sometimes billions or trillions of times.

The “Descent” Metaphor: Valleys and Landscapes

The term gradient descent comes from visualizing optimization as moving downhill on a surface.

If you imagine a curve or landscape where the lowest point represents the best possible model performance, gradient descent is the process of taking steps downhill toward that minimum.

In simple explanations, this landscape is shown as a smooth half-pipe with one clear bottom. But real machine learning models are far messier.

Instead of one clean valley, the landscape looks more like:

Multiple valleys
Uneven slopes
Ridges and plateaus

The model might descend into a valley that looks good locally, but isn’t the best possible solution overall. This is known as getting stuck in a local minimum.

Why Models Get Stuck

When gradient descent updates weights, it only knows one thing: whether the last step made the prediction better or worse.

It does not know where the global best solution is.

That means:

The model can overshoot optimal values.
It can settle into the wrong valley.
It can become “stuck” in a solution that isn’t ideal.

In higher dimensions—where models have millions or even hundreds of millions of weights—this problem becomes dramatically more complex. Instead of a 2D curve, the optimization space becomes something closer to a three-dimensional skate park with countless dips and paths.

The Complexity of Many Weights

Modern models don’t just have one weight—they can have hundreds of millions, billions or trillions.

Each weight:

Has its own optimal value
Is influenced by other weights
Interacts across layers of the model
Affects prediction for everything the model is trained to recognize

If one part of the model becomes a “roadblock,” it can affect everything downstream, much like a traffic jam.

This interconnectedness is one reason training large models takes so long and requires massive amounts of computation.

Overparameterization and Waste

A major open question in machine learning is:

What is the minimum number of weights needed to solve a given problem?

In practice, models are often overparameterized, meaning they contain far more weights than are truly necessary.

For example:

A 100-million-parameter model might only need 20% of those weights to correctly identify a skateboard in an image.
The remaining weights may contribute little—or even interfere with learning.

Despite this, larger models are often used because:

They increase the odds of finding a good solution.
We lack strong tools to fully interpret and prune models efficiently.

The result is higher costs, longer training times, and wasted compute.

Conflicting Signals During Training

Another challenge arises when models make multiple incorrect predictions at once.

If the model strongly believes an image belongs to several wrong classes, gradient descent must adjust weights to correct all of those mistakes simultaneously. These competing adjustments can conflict, causing the model to:

Overcorrect
Oscillate
Temporarily perform worse

The hope is that, over many iterations, these conflicts average out and the model converges to a good solution—but that’s not guaranteed.

Why Training Often Means Re-Training

Because gradient descent is sensitive to initial conditions, many models are trained multiple times using different random starting points (called random seeds).

A common workflow looks like this:

Train multiple models on the same data.
Compare results.
Keep the best-performing versions.
Improve and retrain those models.

If a model fails catastrophically or begins to overfit, training can be stopped early to save time and money.

This trial-and-error process is the default approach across the industry.

Why Gradient Descent Still Matters

Despite its imperfections, gradient descent remains the backbone of modern machine learning. It’s the mechanism that allows models to:

Learn from data
Improve over time
Scale to complex tasks like vision, language, and reasoning

Understanding gradient descent—even at a high level—helps demystify how AI systems actually learn, and why training them is both powerful and expensive.

Final Thoughts

Gradient descent isn’t magic. It’s a mathematical guessing game—one that asks better and better questions with each step.

While today’s models rely heavily on brute force, more data, and larger architectures, ongoing research continues to explore smarter, more efficient ways to optimize models without unnecessary waste.

In future discussions, we’ll explore techniques that help improve this process and make models smaller, faster, and more cost-effective.

Want more practical explanations like this? Follow the Practical AI podcast and send us your questions—we’re always happy to dig in.

Watch the podcast here: https://youtu.be/V0KM1C2GQrg

Read our Brute Force Article here: https://www.giild.com/post/why-brute-force-sucks-for-training-models