Understanding Quantization and Precision

Notebooks
Code-along
Compute
Deep learning
LLM
Quantization
GPU
Hugging Face
PyTorch
Explore quantization and floating-point precision in deep learning — covering FP32, FP16, BF16, INT8, and 4-bit formats and their impact on GPU memory and inference speed.
Author

Chris Endemann

Published

March 9, 2026

When working with large language models, you’ll often encounter terms like “FP32”, “FP16”, “INT8”, and “4-bit quantization.” These describe how a model’s weights are stored in memory — and they have a direct impact on how much GPU memory a model requires, how fast it runs, and whether it fits on your hardware at all.

This notebook unpacks these concepts step by step:

  1. Precision: What floating-point formats (FP32, FP16, BF16) mean and how they affect memory.
  2. Quantization: How tools like bitsandbytes reduce precision further (to 8-bit or 4-bit) to shrink memory footprints.
  3. Parameter counts vs. memory: Why the number of model parameters stays the same, but memory usage changes.
  4. A PyTorch gotcha: Why model.parameters() can report misleading numbers after quantization — and how to correctly count parameters.
  5. When to quantize: Practical guidance on where quantization helps most, and where it doesn’t.

Prerequisites

  • Basic familiarity with PyTorch and Hugging Face transformers
  • Access to a GPU runtime (e.g., Google Colab with T4)

Setup

!pip install -q transformers accelerate bitsandbytes torch
import torch
import gc
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

Part 1: What is precision?

Every number in a neural network — every weight, bias, and activation — is stored as a sequence of bits. The precision (or data type) determines how many bits are used per number, which controls both the range and granularity of values that can be represented.

The ruler analogy

Think of precision like the markings on a ruler. A high-precision ruler has markings at every millimeter — you can represent fine distinctions like 3.217 cm vs. 3.218 cm. A low-precision ruler might only have markings at each centimeter — you can still measure things, but 3.217 cm and 3.218 cm both round to 3 cm. You’ve lost the ability to distinguish them, but you need far less space to write down your measurement.

That’s exactly what happens with neural network weights. At FP32, a weight might be stored as 0.31415927. At FP16, it becomes 0.3142 — close, but not identical. At 4-bit, it gets mapped to one of only 16 possible values, like 0.3125. The question is whether those small differences matter for the model’s outputs. For most deep learning tasks, they don’t.

How floating-point numbers are stored

A floating-point number is stored in three parts:

  • Sign bit (1 bit): positive or negative
  • Exponent bits: control the range — how large or small the number can be (like the power in scientific notation)
  • Fraction bits (aka mantissa): control the precision — how many significant digits you get

For example, FP32 uses 1 sign + 8 exponent + 23 fraction = 32 bits. FP16 cuts this to 1 + 5 + 10 = 16 bits. Fewer fraction bits means coarser rounding; fewer exponent bits means a narrower range of representable values. The table below summarizes the common formats:

Data type Bits Exponent Fraction Approximate range Typical use
FP32 (float32) 32 8 23 ~ 1e-38 to ~ 3e38 Default training precision
FP16 (float16) 16 5 10 ~6e-5 to 65504 Mixed-precision training
BF16 (bfloat16) 16 8 7 ~ 1e-38 to ~ 3e38 Training on modern GPUs (A100, H100)
INT8 8 -128 to 127 Post-training quantization
NF4 (4-bit) 4 16 discrete values Aggressive quantization via bitsandbytes

Note that INT8 and NF4 are integer/discrete formats — they don’t have exponent and fraction parts at all. They can only represent a small, fixed set of values, and real-valued weights must be mapped onto those values (more on this in Part 3).

Key insight: precision controls memory per parameter

A model with 1 billion parameters requires:

  • 4 GB at FP32 (4 bytes per param)
  • 2 GB at FP16/BF16 (2 bytes per param)
  • 1 GB at INT8 (1 byte per param)
  • ~0.5 GB at 4-bit (0.5 bytes per param)

The number of parameters hasn’t changed — only how much memory each one occupies.

Let’s verify this with a real model.

Part 2: Loading a model at different precisions

We’ll use a small model — GPT-2 (124M parameters) — to keep things manageable and demonstrate the concepts clearly.

Helper: Measure GPU memory

def get_gpu_memory_mb():
    """Return current GPU memory allocated in MB."""
    if torch.cuda.is_available():
        return torch.cuda.memory_allocated() / 1024**2
    return 0.0

def load_and_measure(model_name, dtype=None, quantization_config=None, label=""):
    """Load a model and report memory usage and parameter info."""
    gc.collect()
    torch.cuda.empty_cache()

    before = get_gpu_memory_mb()

    kwargs = {"device_map": "auto"}
    if dtype:
        kwargs["torch_dtype"] = dtype
    if quantization_config:
        kwargs["quantization_config"] = quantization_config

    model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs)

    after = get_gpu_memory_mb()
    mem_used = after - before

    # Count parameters
    total_params = sum(p.numel() for p in model.parameters())
    total_elements = sum(p.nelement() for p in model.parameters())

    print(f"\n{'='*60}")
    print(f"  {label}")
    print(f"{'='*60}")
    print(f"  GPU memory used:     {mem_used:,.1f} MB")
    print(f"  model.parameters():  {total_params:,} (via numel())")
    print(f"  Expected params:     ~124,000,000 (GPT-2)")

    # Show dtypes present in model
    dtypes = set()
    for p in model.parameters():
        dtypes.add(str(p.dtype))
    print(f"  Parameter dtypes:    {dtypes}")

    return model, mem_used, total_params

FP32 (default)

model_name = "gpt2"

model_fp32, mem_fp32, params_fp32 = load_and_measure(
    model_name, dtype=torch.float32, label="FP32 (32-bit floating point)"
)
del model_fp32
gc.collect()
torch.cuda.empty_cache()

FP16 (half precision)

model_fp16, mem_fp16, params_fp16 = load_and_measure(
    model_name, dtype=torch.float16, label="FP16 (16-bit floating point)"
)
del model_fp16
gc.collect()
torch.cuda.empty_cache()

BF16 (bfloat16)

BF16 uses 16 bits like FP16, but allocates them differently (as shown in the table in Part 1). FP16 gives 10 bits to the fraction for finer precision, but only 5 bits to the exponent — which is why it caps out around 65,504 and can’t represent very small values. BF16 flips this tradeoff: it keeps the same 8-bit exponent as FP32 (giving it the same massive range), at the cost of only 7 fraction bits. In practice, this works well for deep learning — the range matters more than fine-grained precision, and BF16 avoids the overflow/underflow issues that can plague FP16 during training.

model_bf16, mem_bf16, params_bf16 = load_and_measure(
    model_name, dtype=torch.bfloat16, label="BF16 (bfloat16)"
)
del model_bf16
gc.collect()
torch.cuda.empty_cache()

Compare: precision vs. memory

print(f"\nMemory comparison (GPT-2, 124M params):")
print(f"  FP32:  {mem_fp32:,.1f} MB")
print(f"  FP16:  {mem_fp16:,.1f} MB")
print(f"  BF16:  {mem_bf16:,.1f} MB")
print(f"\nParameter count (should be identical):")
print(f"  FP32:  {params_fp32:,}")
print(f"  FP16:  {params_fp16:,}")
print(f"  BF16:  {params_bf16:,}")

At this point, the key takeaway should be clear: reducing precision halves memory, but the parameter count is unchanged. Every weight is still there — it just takes up less space.

Precision reduction vs. quantization: what’s the difference?

What we’ve done so far — loading a model in FP16 or BF16 instead of FP32 — is precision reduction (sometimes called “casting” or “downcasting”). It’s straightforward: each floating-point value is converted to a format with fewer bits, using standard IEEE rounding rules. The value 0.31415927 in FP32 becomes 0.3142 in FP16. There’s no special algorithm involved — it’s just rounding.

Quantization is fundamentally different. It doesn’t just round values to a lower-precision float — it maps them onto a small, discrete set of values (like the 256 integers in INT8, or just 16 values in NF4). This mapping requires decisions that simple rounding can’t make:

  • What range of weight values should map to what integers? (This is called calibration.)
  • Should all layers use the same mapping, or should each layer be calibrated separately?
  • What do you do about outlier weights that fall far outside the typical range?

Different quantization algorithms answer these questions differently, and their choices directly affect how much quality you lose. That’s why quantization is a more involved process than just picking torch.float16 — it’s a compression technique with real engineering behind it.

Part 3: Quantization — mapping weights to fewer values

Going back to our ruler analogy: precision reduction is like switching from a millimeter ruler to a centimeter ruler — you still have a continuous ruler, just with fewer markings. Quantization is like replacing the ruler entirely with a set of labeled bins. Every weight gets sorted into the nearest bin, and from that point on, it’s stored as just a bin number (an integer). The bins are chosen carefully so that the most common weight values land close to a bin center, minimizing the error introduced by this binning.

Here’s the key idea more concretely. Suppose a layer has weights ranging from -1.0 to 1.0, and you’re quantizing to INT8 (256 possible values). A simple approach would:

  1. Find the range of the weights: min = -1.0, max = 1.0.
  2. Divide the range into 256 equally spaced bins, each spanning ~0.0078.
  3. Map each weight to the nearest bin center and store just the bin index (an integer from 0 to 255).
  4. Store the scale factor (bin width) and zero point so you can approximately reconstruct the original value later: reconstructed ≈ scale × integer + zero_point.

This is called linear (uniform) quantization, and it’s the simplest scheme. More advanced methods — like the ones used in practice — improve on this in important ways:

  • LLM.int8() (Dettmers et al., 2022) discovered that a small fraction of “outlier” features in transformer models have very large magnitudes. If you force these into the same bins as normal-range weights, quality collapses. Their solution: detect outlier features at runtime, keep them in FP16, and quantize only the remaining ~99.9% of values to INT8. This mixed-precision decomposition makes 8-bit quantization effectively lossless.
  • NF4 (Dettmers et al., 2023) takes a different approach for 4-bit. Instead of spacing bins evenly, it places them at the quantiles of a normal distribution — because neural network weights are approximately normally distributed. This means bins are denser where weights are most concentrated (near zero) and sparser in the tails, making optimal use of only 16 possible values. Double quantization further compresses the scale factors themselves, saving additional memory.
  • GPTQ (Frantar et al., 2023) uses a one-shot weight quantization approach that considers the interaction between weights: when one weight is rounded to a bin, it adjusts the remaining weights to compensate for the rounding error. This layer-wise optimization enables 3-4 bit quantization of 175B-parameter models with negligible accuracy loss.

The tools below make these algorithms accessible through a simple configuration interface.

Using bitsandbytes for quantization

bitsandbytes integrates directly with Hugging Face transformers, letting you apply these quantization algorithms at model load time with just a configuration flag. Let’s see the memory savings in practice.

8-bit quantization

bnb_config_8bit = BitsAndBytesConfig(load_in_8bit=True)

model_8bit, mem_8bit, params_8bit = load_and_measure(
    model_name, quantization_config=bnb_config_8bit, label="INT8 (8-bit via bitsandbytes)"
)
del model_8bit
gc.collect()
torch.cuda.empty_cache()

4-bit quantization (NF4)

bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # normalized float 4-bit
    bnb_4bit_use_double_quant=True,       # further compress quantization constants
    bnb_4bit_compute_dtype=torch.float16  # compute in FP16 for speed
)

model_4bit, mem_4bit, params_4bit = load_and_measure(
    model_name, quantization_config=bnb_config_4bit, label="NF4 (4-bit via bitsandbytes)"
)

Compare all configurations

print(f"\n{'='*60}")
print(f"  Summary: GPT-2 (124M params) at different precisions")
print(f"{'='*60}")
print(f"  {'Config':<12} {'Memory (MB)':>14} {'numel()':>16}")
print(f"  {'-'*44}")
print(f"  {'FP32':<12} {mem_fp32:>14,.1f} {params_fp32:>16,}")
print(f"  {'FP16':<12} {mem_fp16:>14,.1f} {params_fp16:>16,}")
print(f"  {'BF16':<12} {mem_bf16:>14,.1f} {params_bf16:>16,}")
print(f"  {'INT8':<12} {mem_8bit:>14,.1f} {params_8bit:>16,}")
print(f"  {'NF4 (4-bit)':<12} {mem_4bit:>14,.1f} {params_4bit:>16,}")

Part 4: The PyTorch parameter count gotcha

If you ran the summary above, you may have noticed something odd: the 4-bit model reports ~ 82M parameters via numel() instead of the expected ~ 124M. Did quantization remove 42 million weights?

No. The model still has the same architecture and the same logical number of parameters. The discrepancy comes from how bitsandbytes stores quantized weights — and from the fact that not all parameters get quantized.

What gets quantized (and what doesn’t)

When you load a model with bitsandbytes, only the large linear layer weight matrices are quantized. Smaller parameters — biases, layer normalization weights, and embedding layers — are kept in their original precision (typically FP16 or FP32). This is by design: these small parameters contribute little to total memory, and quantizing them would hurt quality disproportionately.

Why numel() is misleading for quantized parameters

For the parameters that are quantized, bitsandbytes packs multiple low-bit values into each byte:

  • 8-bit: Each weight occupies 1 byte, stored as a uint8 tensor. numel() still returns the correct count here, since it’s one element per weight.
  • 4-bit: Two weights are packed into a single byte, stored as a uint8 tensor of half the length.

When you call p.numel() on a 4-bit quantized parameter, PyTorch reports the number of elements in the storage tensor (the packed uint8 values), not the number of logical weights. Since two 4-bit values are packed into one uint8 element, numel() returns half the true count for those parameters. Combined with the non-quantized parameters (which report correctly), the total numel() across the model ends up somewhere between the true count and half — in GPT-2’s case, ~ 82M instead of ~ 124M.

Correctly counting parameters

To get the real parameter count, we need to check whether each parameter is a quantized bitsandbytes type and recover the original shape:

import bitsandbytes as bnb

def count_parameters_correct(model):
    """Count parameters correctly, handling bitsandbytes quantized layers."""
    total = 0
    quantized = 0
    non_quantized = 0

    for name, param in model.named_parameters():
        if hasattr(param, "quant_state"):
            # This is a bitsandbytes quantized parameter.
            # The original shape is stored in quant_state.
            original_numel = param.quant_state.shape.numel()
            total += original_numel
            quantized += original_numel
        else:
            total += param.numel()
            non_quantized += param.numel()

    return total, quantized, non_quantized
naive_count = sum(p.numel() for p in model_4bit.parameters())
correct_total, quantized_params, non_quantized_params = count_parameters_correct(model_4bit)

print(f"4-bit quantized GPT-2:")
print(f"  Naive numel() count:          {naive_count:,}")
print(f"  Correct parameter count:      {correct_total:,}")
print(f"    - Quantized (true count):   {quantized_params:,}  (numel reports ~{quantized_params // 2:,} due to packing)")
print(f"    - Non-quantized:            {non_quantized_params:,}  (numel reports correctly)")
print(f"  Expected (GPT-2):             ~124,000,000")
print(f"\n  Sanity check: {quantized_params // 2:,} (packed) + {non_quantized_params:,} (unquantized) ≈ {naive_count:,} (naive total) ✓")

What’s happening under the hood

Let’s peek at an individual layer to see the difference between the stored tensor shape and the logical weight shape:

for name, param in model_4bit.named_parameters():
    if hasattr(param, "quant_state"):
        print(f"Layer: {name}")
        print(f"  Storage tensor shape: {param.shape}")
        print(f"  Storage dtype:        {param.dtype}")
        print(f"  numel() reports:      {param.numel():,}")
        print(f"  Original shape:       {param.quant_state.shape}")
        print(f"  Original numel:       {param.quant_state.shape.numel():,}")
        print()
        break  # just show one example

This confirms that quantization doesn’t remove parameters — it repacks them into a more compact representation. The model’s architecture and logical weight count are unchanged, but the storage is compressed.

Summary

What you check What it tells you
p.numel() Number of elements in the storage tensor (misleading for quantized params)
p.quant_state.shape The original logical shape of the weight matrix
GPU memory usage The actual memory footprint — the metric that matters for fitting on your hardware

The bottom line: quantization does not remove parameters. It changes how they’re stored. Always use quant_state or measure GPU memory directly if you want an accurate picture of a quantized model.

Part 5: When to quantize (and how aggressively)

Now that we’ve seen how quantization works mechanically, the natural next question is: when should you actually use it, and how far should you go?

Quantization is primarily an inference technique

Quantization shines at inference time. Training requires high-precision gradients to make stable updates to model weights, and aggressive quantization (8-bit or below) introduces too much noise for standard backpropagation to work well. For this reason, most models are trained at FP32, BF16, or with mixed-precision strategies (FP16 compute with FP32 accumulation), and then quantized after training for deployment.

The notable exception is QLoRA (Dettmers et al., 2023), which freezes a 4-bit quantized base model and trains only small low-rank adapter (LoRA) layers in higher precision. This makes it possible to fine-tune a 65B-parameter model on a single 48GB GPU — but the base weights themselves are never updated in low precision.

A bigger model at lower precision often beats a smaller model at full precision

One of the most practical insights from quantization research: you can often get better results by running a larger model at 4-bit than a smaller model at FP16, using the same GPU memory. For example:

  • A 70B model at 4-bit (~ 35 GB) can fit on a single 48GB GPU and typically outperforms a 13B model at FP16 (~ 26 GB) on reasoning and knowledge benchmarks.
  • A 13B model at 4-bit (~ 6.5 GB) fits comfortably on a 12GB consumer GPU and often outperforms a 7B model at FP16 (~ 14 GB, which wouldn’t even fit).

The rule of thumb: spend your memory budget on more parameters first, then reduce precision to fit. A 4-bit model loses a small amount of quality from quantization, but it gains far more from having access to more learned knowledge and capacity.

Speed and cost: quantization isn’t just about fitting

Quantization isn’t just about fitting a model onto your GPU — it also makes inference faster. Lower-precision operations use less memory bandwidth, and for small batch sizes (which are common in interactive applications), memory bandwidth is often the bottleneck. So a 4-bit model doesn’t just use ~8x less memory than FP32 — it can also generate tokens noticeably faster.

This creates a practical consideration:

  • The question worth asking is not just “which model is most accurate?” but “which model gives me acceptable quality at the speed and cost I need?”
  • For batch applications (processing thousands of documents), the speed improvement from quantization can cut costs significantly.
  • For interactive applications (chatbots, coding assistants), faster token generation directly improves user experience.

How low can you go?

Research suggests 4-bit is the practical sweet spot for inference:

  • Dettmers & Zettlemoyer (2023) ran over 35,000 quantization experiments and found that 4-bit precision is nearly universally optimal when trading off total model bits against zero-shot accuracy. At 3-bit, quality degrades sharply.
  • 8-bit quantization (via LLM.int8()) is effectively lossless for most models — it’s a safe default when memory is tight but you don’t want to risk any quality loss.
  • GPTQ (Frantar et al., 2023) demonstrated that one-shot weight quantization to 3-4 bits is feasible even for 175B-parameter models with negligible accuracy loss, enabling single-GPU inference for models that otherwise require multiple GPUs.

A recent study on Scaling Laws for Precision (Kumar, Ankner et al., 2024) adds important nuance: the quality degradation from post-training quantization grows as models are trained on more data. A model trained to its full data budget may be more sensitive to aggressive quantization than one that was undertrained. This means the “safe” bit-width may shift upward as foundation models continue to scale their training data.

Decision flowchart

When deciding whether and how to quantize, work through these questions:

  1. Does the model fit on your GPU at FP16/BF16? If yes, start there — no quantization needed unless you want faster inference.
  2. Is this for training or inference?
    • Training from scratch: Use BF16 (or FP32 if your GPU lacks BF16 support). Don’t quantize.
    • Fine-tuning: If the full model doesn’t fit, use QLoRA (4-bit base + FP16 adapters).
    • Inference: Continue to step 3.
  3. How much quality loss can you tolerate?
    • None: Use INT8. It’s effectively lossless for most models.
    • Minimal, with significant memory savings: Use 4-bit NF4 with double quantization.
    • You need extreme compression: Try 3-bit, but benchmark on your specific task — quality loss at 3-bit is sharp and task-dependent.
  4. Could you run a larger model by quantizing more aggressively? Often the answer is yes. A 70B model at 4-bit typically beats a 13B model at 16-bit.

Quick reference

Scenario Recommended precision
Training from scratch FP32 or BF16 (mixed precision)
Fine-tuning (full) BF16 or FP16
Fine-tuning (parameter-efficient on limited hardware) QLoRA (4-bit base + FP16 adapters)
Inference (quality-sensitive) 8-bit (INT8) — effectively lossless
Inference (memory/speed-constrained) 4-bit (NF4 or GPTQ) — slight quality loss, large memory/speed gain
Inference (extreme compression) 3-bit or below — expect meaningful quality loss, benchmark carefully
Choosing between model sizes Prefer larger model at lower precision over smaller model at full precision

Cleanup

del model_4bit
gc.collect()
torch.cuda.empty_cache()
print("GPU memory cleared.")

Comments