ML+X Nexus: All Resources

Understanding Quantization and Precision

Chris Endemann — Mon, 09 Mar 2026 00:00:00 GMT

When working with large language models, you’ll often encounter terms like “FP32”, “FP16”, “INT8”, and “4-bit quantization.” These describe how a model’s weights are stored in memory — and they have a direct impact on how much GPU memory a model requires, how fast it runs, and whether it fits on your hardware at all.

This notebook unpacks these concepts step by step:

Precision: What floating-point formats (FP32, FP16, BF16) mean and how they affect memory.
Quantization: How tools like bitsandbytes reduce precision further (to 8-bit or 4-bit) to shrink memory footprints.
Parameter counts vs. memory: Why the number of model parameters stays the same, but memory usage changes.
A PyTorch gotcha: Why model.parameters() can report misleading numbers after quantization — and how to correctly count parameters.
When to quantize: Practical guidance on where quantization helps most, and where it doesn’t.

Prerequisites

Basic familiarity with PyTorch and Hugging Face transformers
Access to a GPU runtime (e.g., Google Colab with T4)

Setup

!pip install -q transformers accelerate bitsandbytes torch

import torch
import gc
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

Part 1: What is precision?

Every number in a neural network — every weight, bias, and activation — is stored as a sequence of bits. The precision (or data type) determines how many bits are used per number, which controls both the range and granularity of values that can be represented.

The ruler analogy

Think of precision like the markings on a ruler. A high-precision ruler has markings at every millimeter — you can represent fine distinctions like 3.217 cm vs. 3.218 cm. A low-precision ruler might only have markings at each centimeter — you can still measure things, but 3.217 cm and 3.218 cm both round to 3 cm. You’ve lost the ability to distinguish them, but you need far less space to write down your measurement.

That’s exactly what happens with neural network weights. At FP32, a weight might be stored as 0.31415927. At FP16, it becomes 0.3142 — close, but not identical. At 4-bit, it gets mapped to one of only 16 possible values, like 0.3125. The question is whether those small differences matter for the model’s outputs. For most deep learning tasks, they don’t.

How floating-point numbers are stored

A floating-point number is stored in three parts:

Sign bit (1 bit): positive or negative
Exponent bits: control the range — how large or small the number can be (like the power in scientific notation)
Fraction bits (aka mantissa): control the precision — how many significant digits you get

For example, FP32 uses 1 sign + 8 exponent + 23 fraction = 32 bits. FP16 cuts this to 1 + 5 + 10 = 16 bits. Fewer fraction bits means coarser rounding; fewer exponent bits means a narrower range of representable values. The table below summarizes the common formats:

Data type	Bits	Exponent	Fraction	Approximate range	Typical use
FP32 (float32)	32	8	23	~ 1e-38 to ~ 3e38	Default training precision
FP16 (float16)	16	5	10	~6e-5 to 65504	Mixed-precision training
BF16 (bfloat16)	16	8	7	~ 1e-38 to ~ 3e38	Training on modern GPUs (A100, H100)
INT8	8	—	—	-128 to 127	Post-training quantization
NF4 (4-bit)	4	—	—	16 discrete values	Aggressive quantization via bitsandbytes

Note that INT8 and NF4 are integer/discrete formats — they don’t have exponent and fraction parts at all. They can only represent a small, fixed set of values, and real-valued weights must be mapped onto those values (more on this in Part 3).

Key insight: precision controls memory per parameter

A model with 1 billion parameters requires:

4 GB at FP32 (4 bytes per param)
2 GB at FP16/BF16 (2 bytes per param)
1 GB at INT8 (1 byte per param)
~0.5 GB at 4-bit (0.5 bytes per param)

The number of parameters hasn’t changed — only how much memory each one occupies.

Let’s verify this with a real model.

Part 2: Loading a model at different precisions

We’ll use a small model — GPT-2 (124M parameters) — to keep things manageable and demonstrate the concepts clearly.

Helper: Measure GPU memory

def get_gpu_memory_mb():
    """Return current GPU memory allocated in MB."""
    if torch.cuda.is_available():
        return torch.cuda.memory_allocated() / 1024**2
    return 0.0

def load_and_measure(model_name, dtype=None, quantization_config=None, label=""):
    """Load a model and report memory usage and parameter info."""
    gc.collect()
    torch.cuda.empty_cache()

    before = get_gpu_memory_mb()

    kwargs = {"device_map": "auto"}
    if dtype:
        kwargs["torch_dtype"] = dtype
    if quantization_config:
        kwargs["quantization_config"] = quantization_config

    model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs)

    after = get_gpu_memory_mb()
    mem_used = after - before

    # Count parameters
    total_params = sum(p.numel() for p in model.parameters())
    total_elements = sum(p.nelement() for p in model.parameters())

    print(f"\n{'='*60}")
    print(f"  {label}")
    print(f"{'='*60}")
    print(f"  GPU memory used:     {mem_used:,.1f} MB")
    print(f"  model.parameters():  {total_params:,} (via numel())")
    print(f"  Expected params:     ~124,000,000 (GPT-2)")

    # Show dtypes present in model
    dtypes = set()
    for p in model.parameters():
        dtypes.add(str(p.dtype))
    print(f"  Parameter dtypes:    {dtypes}")

    return model, mem_used, total_params

FP32 (default)

model_name = "gpt2"

model_fp32, mem_fp32, params_fp32 = load_and_measure(
    model_name, dtype=torch.float32, label="FP32 (32-bit floating point)"
)
del model_fp32
gc.collect()
torch.cuda.empty_cache()

FP16 (half precision)

model_fp16, mem_fp16, params_fp16 = load_and_measure(
    model_name, dtype=torch.float16, label="FP16 (16-bit floating point)"
)
del model_fp16
gc.collect()
torch.cuda.empty_cache()

BF16 (bfloat16)

BF16 uses 16 bits like FP16, but allocates them differently (as shown in the table in Part 1). FP16 gives 10 bits to the fraction for finer precision, but only 5 bits to the exponent — which is why it caps out around 65,504 and can’t represent very small values. BF16 flips this tradeoff: it keeps the same 8-bit exponent as FP32 (giving it the same massive range), at the cost of only 7 fraction bits. In practice, this works well for deep learning — the range matters more than fine-grained precision, and BF16 avoids the overflow/underflow issues that can plague FP16 during training.

model_bf16, mem_bf16, params_bf16 = load_and_measure(
    model_name, dtype=torch.bfloat16, label="BF16 (bfloat16)"
)
del model_bf16
gc.collect()
torch.cuda.empty_cache()

Compare: precision vs. memory

print(f"\nMemory comparison (GPT-2, 124M params):")
print(f"  FP32:  {mem_fp32:,.1f} MB")
print(f"  FP16:  {mem_fp16:,.1f} MB")
print(f"  BF16:  {mem_bf16:,.1f} MB")
print(f"\nParameter count (should be identical):")
print(f"  FP32:  {params_fp32:,}")
print(f"  FP16:  {params_fp16:,}")
print(f"  BF16:  {params_bf16:,}")

At this point, the key takeaway should be clear: reducing precision halves memory, but the parameter count is unchanged. Every weight is still there — it just takes up less space.

Precision reduction vs. quantization: what’s the difference?

What we’ve done so far — loading a model in FP16 or BF16 instead of FP32 — is precision reduction (sometimes called “casting” or “downcasting”). It’s straightforward: each floating-point value is converted to a format with fewer bits, using standard IEEE rounding rules. The value 0.31415927 in FP32 becomes 0.3142 in FP16. There’s no special algorithm involved — it’s just rounding.

Quantization is fundamentally different. It doesn’t just round values to a lower-precision float — it maps them onto a small, discrete set of values (like the 256 integers in INT8, or just 16 values in NF4). This mapping requires decisions that simple rounding can’t make:

What range of weight values should map to what integers? (This is called calibration.)
Should all layers use the same mapping, or should each layer be calibrated separately?
What do you do about outlier weights that fall far outside the typical range?

Different quantization algorithms answer these questions differently, and their choices directly affect how much quality you lose. That’s why quantization is a more involved process than just picking torch.float16 — it’s a compression technique with real engineering behind it.

Part 3: Quantization — mapping weights to fewer values

Going back to our ruler analogy: precision reduction is like switching from a millimeter ruler to a centimeter ruler — you still have a continuous ruler, just with fewer markings. Quantization is like replacing the ruler entirely with a set of labeled bins. Every weight gets sorted into the nearest bin, and from that point on, it’s stored as just a bin number (an integer). The bins are chosen carefully so that the most common weight values land close to a bin center, minimizing the error introduced by this binning.

Here’s the key idea more concretely. Suppose a layer has weights ranging from -1.0 to 1.0, and you’re quantizing to INT8 (256 possible values). A simple approach would:

Find the range of the weights: min = -1.0, max = 1.0.
Divide the range into 256 equally spaced bins, each spanning ~0.0078.
Map each weight to the nearest bin center and store just the bin index (an integer from 0 to 255).
Store the scale factor (bin width) and zero point so you can approximately reconstruct the original value later: reconstructed ≈ scale × integer + zero_point.

This is called linear (uniform) quantization, and it’s the simplest scheme. More advanced methods — like the ones used in practice — improve on this in important ways:

LLM.int8() (Dettmers et al., 2022) discovered that a small fraction of “outlier” features in transformer models have very large magnitudes. If you force these into the same bins as normal-range weights, quality collapses. Their solution: detect outlier features at runtime, keep them in FP16, and quantize only the remaining ~99.9% of values to INT8. This mixed-precision decomposition makes 8-bit quantization effectively lossless.
NF4 (Dettmers et al., 2023) takes a different approach for 4-bit. Instead of spacing bins evenly, it places them at the quantiles of a normal distribution — because neural network weights are approximately normally distributed. This means bins are denser where weights are most concentrated (near zero) and sparser in the tails, making optimal use of only 16 possible values. Double quantization further compresses the scale factors themselves, saving additional memory.
GPTQ (Frantar et al., 2023) uses a one-shot weight quantization approach that considers the interaction between weights: when one weight is rounded to a bin, it adjusts the remaining weights to compensate for the rounding error. This layer-wise optimization enables 3-4 bit quantization of 175B-parameter models with negligible accuracy loss.

The tools below make these algorithms accessible through a simple configuration interface.

Using bitsandbytes for quantization

bitsandbytes integrates directly with Hugging Face transformers, letting you apply these quantization algorithms at model load time with just a configuration flag. Let’s see the memory savings in practice.

8-bit quantization

bnb_config_8bit = BitsAndBytesConfig(load_in_8bit=True)

model_8bit, mem_8bit, params_8bit = load_and_measure(
    model_name, quantization_config=bnb_config_8bit, label="INT8 (8-bit via bitsandbytes)"
)
del model_8bit
gc.collect()
torch.cuda.empty_cache()

4-bit quantization (NF4)

bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # normalized float 4-bit
    bnb_4bit_use_double_quant=True,       # further compress quantization constants
    bnb_4bit_compute_dtype=torch.float16  # compute in FP16 for speed
)

model_4bit, mem_4bit, params_4bit = load_and_measure(
    model_name, quantization_config=bnb_config_4bit, label="NF4 (4-bit via bitsandbytes)"
)

Compare all configurations

print(f"\n{'='*60}")
print(f"  Summary: GPT-2 (124M params) at different precisions")
print(f"{'='*60}")
print(f"  {'Config':<12} {'Memory (MB)':>14} {'numel()':>16}")
print(f"  {'-'*44}")
print(f"  {'FP32':<12} {mem_fp32:>14,.1f} {params_fp32:>16,}")
print(f"  {'FP16':<12} {mem_fp16:>14,.1f} {params_fp16:>16,}")
print(f"  {'BF16':<12} {mem_bf16:>14,.1f} {params_bf16:>16,}")
print(f"  {'INT8':<12} {mem_8bit:>14,.1f} {params_8bit:>16,}")
print(f"  {'NF4 (4-bit)':<12} {mem_4bit:>14,.1f} {params_4bit:>16,}")

Part 4: The PyTorch parameter count gotcha

If you ran the summary above, you may have noticed something odd: the 4-bit model reports ~ 82M parameters via numel() instead of the expected ~ 124M. Did quantization remove 42 million weights?

No. The model still has the same architecture and the same logical number of parameters. The discrepancy comes from how bitsandbytes stores quantized weights — and from the fact that not all parameters get quantized.

What gets quantized (and what doesn’t)

When you load a model with bitsandbytes, only the large linear layer weight matrices are quantized. Smaller parameters — biases, layer normalization weights, and embedding layers — are kept in their original precision (typically FP16 or FP32). This is by design: these small parameters contribute little to total memory, and quantizing them would hurt quality disproportionately.

Why `numel()` is misleading for quantized parameters

For the parameters that are quantized, bitsandbytes packs multiple low-bit values into each byte:

8-bit: Each weight occupies 1 byte, stored as a uint8 tensor. numel() still returns the correct count here, since it’s one element per weight.
4-bit: Two weights are packed into a single byte, stored as a uint8 tensor of half the length.

When you call p.numel() on a 4-bit quantized parameter, PyTorch reports the number of elements in the storage tensor (the packed uint8 values), not the number of logical weights. Since two 4-bit values are packed into one uint8 element, numel() returns half the true count for those parameters. Combined with the non-quantized parameters (which report correctly), the total numel() across the model ends up somewhere between the true count and half — in GPT-2’s case, ~ 82M instead of ~ 124M.

Correctly counting parameters

To get the real parameter count, we need to check whether each parameter is a quantized bitsandbytes type and recover the original shape:

import bitsandbytes as bnb

def count_parameters_correct(model):
    """Count parameters correctly, handling bitsandbytes quantized layers."""
    total = 0
    quantized = 0
    non_quantized = 0

    for name, param in model.named_parameters():
        if hasattr(param, "quant_state"):
            # This is a bitsandbytes quantized parameter.
            # The original shape is stored in quant_state.
            original_numel = param.quant_state.shape.numel()
            total += original_numel
            quantized += original_numel
        else:
            total += param.numel()
            non_quantized += param.numel()

    return total, quantized, non_quantized

naive_count = sum(p.numel() for p in model_4bit.parameters())
correct_total, quantized_params, non_quantized_params = count_parameters_correct(model_4bit)

print(f"4-bit quantized GPT-2:")
print(f"  Naive numel() count:          {naive_count:,}")
print(f"  Correct parameter count:      {correct_total:,}")
print(f"    - Quantized (true count):   {quantized_params:,}  (numel reports ~{quantized_params // 2:,} due to packing)")
print(f"    - Non-quantized:            {non_quantized_params:,}  (numel reports correctly)")
print(f"  Expected (GPT-2):             ~124,000,000")
print(f"\n  Sanity check: {quantized_params // 2:,} (packed) + {non_quantized_params:,} (unquantized) ≈ {naive_count:,} (naive total) ✓")

What’s happening under the hood

Let’s peek at an individual layer to see the difference between the stored tensor shape and the logical weight shape:

for name, param in model_4bit.named_parameters():
    if hasattr(param, "quant_state"):
        print(f"Layer: {name}")
        print(f"  Storage tensor shape: {param.shape}")
        print(f"  Storage dtype:        {param.dtype}")
        print(f"  numel() reports:      {param.numel():,}")
        print(f"  Original shape:       {param.quant_state.shape}")
        print(f"  Original numel:       {param.quant_state.shape.numel():,}")
        print()
        break  # just show one example

This confirms that quantization doesn’t remove parameters — it repacks them into a more compact representation. The model’s architecture and logical weight count are unchanged, but the storage is compressed.

Summary

What you check	What it tells you
`p.numel()`	Number of elements in the storage tensor (misleading for quantized params)
`p.quant_state.shape`	The original logical shape of the weight matrix
GPU memory usage	The actual memory footprint — the metric that matters for fitting on your hardware

The bottom line: quantization does not remove parameters. It changes how they’re stored. Always use quant_state or measure GPU memory directly if you want an accurate picture of a quantized model.

Part 5: When to quantize (and how aggressively)

Now that we’ve seen how quantization works mechanically, the natural next question is: when should you actually use it, and how far should you go?

Quantization is primarily an inference technique

Quantization shines at inference time. Training requires high-precision gradients to make stable updates to model weights, and aggressive quantization (8-bit or below) introduces too much noise for standard backpropagation to work well. For this reason, most models are trained at FP32, BF16, or with mixed-precision strategies (FP16 compute with FP32 accumulation), and then quantized after training for deployment.

The notable exception is QLoRA (Dettmers et al., 2023), which freezes a 4-bit quantized base model and trains only small low-rank adapter (LoRA) layers in higher precision. This makes it possible to fine-tune a 65B-parameter model on a single 48GB GPU — but the base weights themselves are never updated in low precision.

A bigger model at lower precision often beats a smaller model at full precision

One of the most practical insights from quantization research: you can often get better results by running a larger model at 4-bit than a smaller model at FP16, using the same GPU memory. For example:

A 70B model at 4-bit (~ 35 GB) can fit on a single 48GB GPU and typically outperforms a 13B model at FP16 (~ 26 GB) on reasoning and knowledge benchmarks.
A 13B model at 4-bit (~ 6.5 GB) fits comfortably on a 12GB consumer GPU and often outperforms a 7B model at FP16 (~ 14 GB, which wouldn’t even fit).

The rule of thumb: spend your memory budget on more parameters first, then reduce precision to fit. A 4-bit model loses a small amount of quality from quantization, but it gains far more from having access to more learned knowledge and capacity.

Speed and cost: quantization isn’t just about fitting

Quantization isn’t just about fitting a model onto your GPU — it also makes inference faster. Lower-precision operations use less memory bandwidth, and for small batch sizes (which are common in interactive applications), memory bandwidth is often the bottleneck. So a 4-bit model doesn’t just use ~8x less memory than FP32 — it can also generate tokens noticeably faster.

This creates a practical consideration:

The question worth asking is not just “which model is most accurate?” but “which model gives me acceptable quality at the speed and cost I need?”
For batch applications (processing thousands of documents), the speed improvement from quantization can cut costs significantly.
For interactive applications (chatbots, coding assistants), faster token generation directly improves user experience.

How low can you go?

Research suggests 4-bit is the practical sweet spot for inference:

Dettmers & Zettlemoyer (2023) ran over 35,000 quantization experiments and found that 4-bit precision is nearly universally optimal when trading off total model bits against zero-shot accuracy. At 3-bit, quality degrades sharply.
8-bit quantization (via LLM.int8()) is effectively lossless for most models — it’s a safe default when memory is tight but you don’t want to risk any quality loss.
GPTQ (Frantar et al., 2023) demonstrated that one-shot weight quantization to 3-4 bits is feasible even for 175B-parameter models with negligible accuracy loss, enabling single-GPU inference for models that otherwise require multiple GPUs.

A recent study on Scaling Laws for Precision (Kumar, Ankner et al., 2024) adds important nuance: the quality degradation from post-training quantization grows as models are trained on more data. A model trained to its full data budget may be more sensitive to aggressive quantization than one that was undertrained. This means the “safe” bit-width may shift upward as foundation models continue to scale their training data.

Decision flowchart

When deciding whether and how to quantize, work through these questions:

Does the model fit on your GPU at FP16/BF16? If yes, start there — no quantization needed unless you want faster inference.
Is this for training or inference?
- Training from scratch: Use BF16 (or FP32 if your GPU lacks BF16 support). Don’t quantize.
- Fine-tuning: If the full model doesn’t fit, use QLoRA (4-bit base + FP16 adapters).
- Inference: Continue to step 3.
How much quality loss can you tolerate?
- None: Use INT8. It’s effectively lossless for most models.
- Minimal, with significant memory savings: Use 4-bit NF4 with double quantization.
- You need extreme compression: Try 3-bit, but benchmark on your specific task — quality loss at 3-bit is sharp and task-dependent.
Could you run a larger model by quantizing more aggressively? Often the answer is yes. A 70B model at 4-bit typically beats a 13B model at 16-bit.

Quick reference

Scenario	Recommended precision
Training from scratch	FP32 or BF16 (mixed precision)
Fine-tuning (full)	BF16 or FP16
Fine-tuning (parameter-efficient on limited hardware)	QLoRA (4-bit base + FP16 adapters)
Inference (quality-sensitive)	8-bit (INT8) — effectively lossless
Inference (memory/speed-constrained)	4-bit (NF4 or GPTQ) — slight quality loss, large memory/speed gain
Inference (extreme compression)	3-bit or below — expect meaningful quality loss, benchmark carefully
Choosing between model sizes	Prefer larger model at lower precision over smaller model at full precision

Cleanup

del model_4bit
gc.collect()
torch.cuda.empty_cache()
print("GPU memory cleared.")

Comments

Intro to GCP for Machine Learning & AI

Chris Endemann — Thu, 05 Mar 2026 00:00:00 GMT

This Intro to GCP workshop teaches core workflows for building, training, and tuning ML/AI models in Google Cloud’s Vertex AI platform. Participants learn to set up data, configure Vertex AI Workbench notebooks, launch training and tuning jobs, and optimize resource costs effectively within GCP. The workshop also includes a section on building retrieval-augmented generation (RAG) pipelines using Gemini models.

UW-Madison Cloud Users

A personal GCP account is fine for this workshop. However, for long-term research use, we recommend switching to a UW-provisioned GCP account. You’ll get institutional pricing, lower overhead on grants (26% instead of 55.5% — saving ~$2,950 per $10k in cloud costs), data protection agreements (including BAA for HIPAA), and dedicated support from the Public Cloud Team. NIH-funded researchers can get additional discounts through the STRIDES Initiative. You can also apply for $5,000 in Google Cloud Research Credits.

Request a UW GCP account | Why use a UW account? | Full details: UW Cloud Services

Cost estimate

Running through this workshop should cost approximately $3–$8 on GCP, assuming short GPU runs and limited hyperparameter tuning trials. Using n2-standard-4 or e2-standard-4 instances with a single T4 GPU generally stays within this range. New accounts may be eligible for $300 in free GCP credits, which typically cover the full cost of this workshop. It is recommended to track usage in the GCP Billing Console and delete unused resources once completed.

Prerequisites

Estimated time to complete

4–6 hours: Based on running through training, tuning, and the Gemini RAG pipeline example.

Comments

How to Contribute?

ML+X — Mon, 02 Mar 2026 00:00:00 GMT

Nexus is built by people sharing what they’ve learned with each other. This guide covers what we’re looking for, what makes a good contribution, and how to submit one.

Our philosophy: share what you know

Nexus isn’t a link aggregator — it’s a place where community members share resources they have personally used and can speak to. Every contribution should reflect real experience:

Share what has genuinely helped you. A workshop that clarified a tricky concept, a library that saved you hours, a dataset that was well-suited for a particular task — if it made a difference in your work, it belongs here.
Add your perspective. Don’t just point to a resource — tell the community why it’s worth their time. What did you find most useful? What are its strengths and limitations? Who is it best suited for?
Be authentic. In an era of abundant AI-generated content, what makes Nexus valuable is the human curation and firsthand experience behind every post. We’d rather have a short, honest write-up from someone who used the resource than a polished summary from someone who didn’t.

Using generative AI when contributing

GenAI tools are great for writing — we use them too. The key is that you stay in the driver’s seat. Here’s what that looks like:

Steer with your experience. Whether you draft from scratch or use AI to help, make sure the post reflects what you actually know — your impressions, recommendations, and caveats. AI doesn’t have your experience, so you need to bring that.
Review carefully and iterate. AI gets things wrong. Read through the output critically, fix inaccuracies, and make sure every claim holds up against your own understanding. A few rounds of back-and-forth usually gets things where they need to be.
Make it sound like you. The most useful Nexus posts have a real point of view — why you found something helpful, what tripped you up, who you’d recommend it to. If a post could have been written by anyone about anything, it needs more of you in it.

In short: use AI however it fits your workflow — just make sure the end result genuinely reflects your experience and reads like something you stand behind.

What makes a good post?

You don’t need to write a lot — you just need to be helpful and specific:

A clear title that tells readers what the resource is and what it covers.
A brief description in your own words: what the resource does, why you found it valuable, and who it’s best for.
Prerequisites — what should someone already know before diving in?
Strengths and limitations — a balanced, honest take helps people decide if the resource fits their needs.
Links to related resources on Nexus, when relevant.

Use one of the resource templates below to get started — they’ll guide you through the structure.

Examples of good posts

Click “Improve this page” near the top right of any post to view its source code.

External content (resources you’re recommending):

Original content (things you’ve written):

Need topic ideas?

Browse open issues labeled “Resource” on the GitHub Issues page for community-requested topics. If you’d like to tackle one, comment on the issue to let others know.

How to submit a new post

Quick overview

Create a GitHub Issue announcing your planned resource
Fork and clone the repo
Create a branch, write your post using a template
Commit, push, and open a Pull Request
Wait for review

New to Git? We recommend GitHub Desktop — see our GitHub Desktop guide for a walkthrough. UW-Madison researchers can also get help at the Data Science Hub’s Coding Meetup office hours.

Step-by-step instructions

1. Create an Issue on GitHub

Before writing, announce your plan so the team can provide early feedback.

Go to Nexus GitHub Issues and click New Issue.
Title it with the name of your resource and add the “Resource” label.
Briefly describe why this resource should be on Nexus.
Wait for feedback before proceeding.

2. Fork and clone the repo

Go to the ML+X Nexus repository and click Fork (top-right).
Clone your fork to your local machine. In GitHub Desktop: File > Clone Repository, paste the URL, and click Clone.

3. Create a branch and write your post

In GitHub Desktop: Branch > New Branch. Name it descriptively (e.g., workshop-introDL, video-NeurIPS2024).
Copy the appropriate template into the correct folder (see folder structure below).
Edit the template in your preferred text editor. The template comments will guide you.

4. Commit, push, and open a Pull Request

In GitHub Desktop, write a descriptive commit message and click Commit.
Click Push to upload your changes.
On GitHub, click Compare & pull request, ensure you’re merging into the main branch, write a short description, and submit.

5. Review

A Nexus developer will review your PR. They may request changes — just push additional commits to your branch.

Resource templates

Start with one of these and follow the inline comments:

Learn: Blog · Book · Notebook · Video · Workshop

Stories: Blog · Video

Toolbox: Data · Model · Library

Where to place your file

Place your .qmd file in the appropriate subfolder:

├── Learn/
│   ├── Blogs, Books, Guides, Notebooks, Videos, Workshops
├── Applications/ (Stories)
│   ├── Blogs, Videos
├── Toolbox/
│   ├── Compute, Data, GenAI, Libraries, MLOps, Models

If your resource doesn’t fit an existing category, propose a new one in your GitHub Issue.

Previewing locally (optional)

If you have Quarto installed, you can preview your post before submitting:

quarto preview <your_file.qmd>

Don’t spend too much time perfecting things locally — you can always iterate after the PR is open.

Contributing a Jupyter notebook

If you’re sharing a code-along notebook (e.g., a tutorial, demo, or walkthrough), the workflow is slightly different from other resource types. Nexus stores notebooks as .qmd (Quarto markdown) files, and a GitHub Action automatically generates the .ipynb after your PR is merged.

The workflow:

Develop in a notebook — write your content in Jupyter (.ipynb) as you normally would.
Convert to .qmd — when you’re ready to submit, run: bash quarto convert your_notebook.ipynb This produces a .qmd file with your code cells and markdown preserved.
Add YAML front matter — open the .qmd and add the required metadata header at the top. Use the notebook template as a reference. Key fields include jupyter: python3 and execute: eval: false (prevents code from running during the site build).
Submit a PR with the .qmd file — place it in Learn/Notebooks/. Do not include the .ipynb in your PR; the GitHub Action will generate it automatically after merge.

For a working example, see the RAG: Romeo and Juliet notebook and its source .qmd.

How to improve an existing post

Want to fix a typo, add a code-along, or share your perspective on an existing resource? Anyone is welcome to suggest improvements.

From the post’s page on Nexus, click “Improve this page” (top-right).
On GitHub, click the pencil icon to edit the file in your browser.
Make your changes, scroll down, and choose “Create a new branch for this commit and start a pull request.”
Click Propose changes, then Create pull request.

That’s it — the maintainers will review and merge if everything looks good.

Comments

Every resource post on Nexus has a Comments section at the bottom, powered by giscus (backed by GitHub Discussions). Visitors can leave reactions and comments using their GitHub account. This is enabled automatically for all posts — you don’t need to add anything to your front matter. If you ever need to disable comments on a specific page, add comments: false to the YAML front matter.

Comments

UW-Madison Cloud Services (AWS, GCP, Azure)

Chris Endemann — Mon, 02 Mar 2026 00:00:00 GMT

UW-Madison offers enterprise cloud computing through contracts with Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure. These services are managed by the UW Public Cloud Team, a cross-disciplinary group of operations, cybersecurity, and research cyberinfrastructure (RCI) professionals.

Using a UW-provisioned cloud account — rather than a personal one — gives you access to institutional pricing discounts, lower overhead on grants, data protection agreements, security monitoring, and dedicated support. If you’re doing any research or university work in the cloud, start here.

Why run ML/AI in the cloud?

You have ML/AI code that works on your laptop. But at some point you need more — a bigger GPU (or several), a dataset that won’t fit on disk, or the ability to run dozens of training experiments overnight. You could invest in local hardware or compete for time on a shared HPC cluster, but cloud platforms let you rent exactly the hardware you need, for exactly as long as you need it, and then shut it down.

Cloud vs. university HPC clusters

Most universities offer shared HPC clusters with GPUs. These are excellent resources — but they have tradeoffs worth understanding:

Factor	University HPC	Cloud (AWS, GCP, Azure)
Cost	Free or subsidized	Pay per hour
GPU availability	Shared queue; wait times during peak periods and per-job runtime limits (often 24–72 hrs) that may require checkpointing long training runs	On-demand (subject to quota); jobs run as long as needed
Hardware variety	Fixed hardware refresh cycle (3–5 years)	Latest GPUs available immediately (A100, H100, B200)
Scaling	Limited by cluster size	Spin up hundreds of jobs in parallel
Multi-GPU / NVLink	Sometimes available, depends on cluster	Available on demand — essential for training, fine-tuning, or serving large LLMs that don’t fit in a single GPU’s memory
Job orchestration	Writing scheduler scripts, packaging environments, and wiring up parallel job arrays can take significant refactoring	Managed ML platforms (Vertex AI, SageMaker, Azure ML) handle provisioning, parallelism, and teardown
Software environment	Module system; some clusters support containers — research computing staff can often help with setup	Prebuilt containers for common ML frameworks (PyTorch, TensorFlow, XGBoost); bring your own Docker image for full control

The short version: use your university cluster when it has the hardware you need and the queue isn’t blocking you. Use the cloud when you need hardware your cluster doesn’t have, need to scale beyond what the queue allows, or need a specific software environment you can’t easily get on-campus. Many researchers use both — develop and test on HPC, then scale to cloud for large experiments or specialized hardware.

When does model size justify cloud compute?

Not every model needs cloud hardware. Here’s a rough guide:

Model scale	Parameters	Example models	Where to run
Small	< 10M	Logistic regression, small CNNs, XGBoost	Laptop — HPC cloud adds overhead without much benefit
Medium	10M–500M	ResNets, BERT-base, mid-sized transformers	HPC with a single GPU (RTX 2080 Ti, L40) or cloud (T4, L4)
Large	500M–10B	GPT-2, LLaMA-7B, fine-tuning large transformers	HPC with A100 (40/80 GB) or cloud — both work well
Very large	10B–70B	LLaMA-70B, Mixtral	HPC with H100/H200 (80–141 GB) or cloud
Frontier	70B+	GPT-4-scale, multi-expert models	Cloud — requires multi-node NVLink clusters beyond what most HPC queues offer

CHTC’s GPU Lab covers more than you might think. The GPU Lab includes A100s (40 and 80 GB), H100s (80 GB), and H200s (141 GB) — enough VRAM to run inference on models up to ~70B parameters with quantization, or to fine-tune smaller models on a single high-memory GPU. For many UW researchers, this hardware handles “large model” workloads without needing cloud. Note that CHTC GPUs are not NVLink-connected, so multi-GPU parallelism is limited to methods that don’t require fast inter-GPU communication. Jobs have time limits (12 hrs for short, 24 hrs for medium, 7 days for long jobs), so plan your runs accordingly.

Cloud becomes the clear choice when you need NVLink multi-GPU or multi-node setups for frontier-scale training or inference, long-running services like RAG applications or model endpoints that need to stay up beyond HPC job time limits, or when queue wait times are blocking a deadline.

LLM APIs: skip the infrastructure entirely

For many GenAI tasks, you don’t need to provision GPUs at all. Services like the OpenAI API, Google’s Vertex AI, and Amazon Bedrock let you call frontier models (GPT-4o, Gemini, Claude, etc.) with a simple API request — no GPU provisioning, no model hosting. LLM API calls cost fractions of a cent each and are often the fastest, most cost-effective path. See GenAI at UW-Madison for available services.

A note on cloud costs

Cloud computing is not free, but it’s worth putting costs in context:

Hardware is expensive and ages fast. A single A100 GPU costs ~$15,000 and is outdated within a few years. Cloud lets you rent the latest hardware by the hour.
You pay only for what you use. Stop a VM and the meter stops — valuable for bursty research workloads. A single T4 GPU instance runs ~$1–3/hr. Fine-tuning a small model on a moderate dataset might cost $10–50.
Managed services save development time. You don’t have to write scheduling logic, package custom containers, or maintain orchestration infrastructure — managed ML platforms handle that plumbing so you can focus on the ML.
Budgets and alerts keep you safe. All three platforms offer billing dashboards and budget alerts to prevent surprise bills.

The key habit: choose the right machine size, stop resources when idle, and monitor spending.

Note

Cloud isn’t the right fit for every workload. If you want to avoid cloud costs, UW’s CHTC offers free GPU access for batch jobs (though jobs are queued and have runtime limits). Many researchers use a mix of both.

Tip

There is a learning curve, as with any new tool. But UW-developed workshop materials are available to help you get started — see the Related resources at the bottom of this page.

Why use a UW-provisioned account?

A self-provisioned cloud account (e.g., one you create directly with Google or AWS) is a personal agreement between you and the vendor — it is not covered by UW-Madison’s institutional contracts. By going through the UW Public Cloud Team, you get:

Negotiated pricing: UW contracts leverage Internet2 NET+ agreements and institutional reseller rates. For example, GCP accounts include a network egress waiver (up to 15% of your total bill), and Azure accounts receive ~3.5% off retail pricing.
Lower overhead on grants: Normally, UW adds 55.5% in overhead (F&A) to cloud expenses on grants. With a UW cloud account, that drops to 26% — so for every $10,000 you spend on cloud computing, you save about $2,950 in overhead. See the Cloud Computing Pilot for details.
NIH STRIDES discounts: NIH-funded researchers get additional cloud pricing discounts (on top of the UW contract rates) through the STRIDES Initiative. The UW cloud team can transition you into or out of STRIDES at any time — no data migration needed.
Business Associates Agreement (BAA): UW’s contracts include a BAA that governs vendor access to your data, which is critical for HIPAA-regulated health data.
Security monitoring: UW accounts benefit from Security Command Center monitoring with alerts escalated to the UW Cybersecurity Operations Team (CSOC).
Baseline security configuration: Accounts come pre-configured to meet CIS benchmark standards with NetID authentication built in.
Dedicated support: Get help from the DoIT Cloud Team via email (cloud-services@cio.wisc.edu), office hours, and in-person/video consultations.

For the full breakdown, see Why Should I Use a UW Madison Public Cloud Account? on the UW KnowledgeBase.

Paying for cloud compute with grant money

If you’re using grant funding to pay for cloud compute — from NIH, NSF, DOE, or any other sponsor — a UW-provisioned account can significantly reduce what your grant actually pays.

Lower overhead (Cloud Computing Pilot)

UW-Madison normally adds 55.5% in overhead (formally called “F&A” or “facilities & administrative costs”) to cloud expenses on grants. The Cloud Computing Pilot cuts that to 26% when you use a UW-provisioned cloud account. In practice, that means for every $10,000 in cloud spending, you’ll pay ~$2,600 in overhead instead of ~$5,550 — a savings of about $2,950.

Applies to new proposals and awards (including new funding increments).
You must use a UW cloud account — costs paid via purchasing card or personal accounts are charged the full 55.5%.
RSP provides budget templates to help you plan proposals with the reduced rate.
Contact RSP with questions about grant compliance.

NIH STRIDES Initiative

If you have NIH funding specifically, you can get additional cloud discounts on top of the standard UW rates through the STRIDES Initiative. STRIDES covers AWS, GCP, and Azure:

Discounted pricing on cloud services, layered on top of UW’s institutional rates.
Professional service consultations and technical support from STRIDES partners.
No data or configuration changes needed — the UW cloud team can transition you in or out at any time.

How to request a UW cloud account

To get started with any of the three platforms:

Get a DoIT Billing Customer ID — you’ll need this to tie your cloud usage to a funding source.
Fill out the UW-Madison Cloud Account Request Form — this covers AWS, GCP, and Azure. Indicate your intended data types and use case.
For sensitive/restricted data — you must complete a Cybersecurity risk assessment before processing HIPAA, FERPA, or other regulated data in the cloud.

Platform-specific details:

Research credits & training

Research credits

All three cloud providers offer credit programs for academic researchers:

Platform	Program	Amount	Eligibility
GCP	Cloud Research Credits	Up to $5,000 (faculty/postdocs); $1,000 (PhD students)	Faculty, postdocs, non-profit researchers, PhD students
AWS	Cloud Credit for Research	Varies by proposal	Researchers at accredited institutions; students may receive up to $5,000
Azure	Azure for Research	Varies by proposal	Faculty, researchers, and graduate students at accredited institutions
Azure	Azure Quantum Credits	Up to $10,000	Project-by-project basis; evaluated on research, educational, or commercial value

All three programs are rolling applications. You’ll need a research proposal describing your intended cloud usage and the specific services you plan to use.

Free cloud training

Each platform offers free, self-paced training to help you get started:

GCP: UW-Madison has a limited number of seats for Google Cloud Skills Boost — reach out to the Public Cloud Team at cloud-services@cio.wisc.edu to request access.
AWS: AWS Skill Builder offers 600+ free courses covering compute, ML, and more.
Azure: Microsoft Learn provides free, structured learning paths for Azure services.

Data protection & compliance

UW-Madison classifies institutional data into four risk categories: Restricted, Sensitive, Internal, and Public. Cloud eligibility depends on data classification:

Data type	Cloud eligible?	Requirements
Public / Internal	Yes	Standard UW cloud account
Sensitive	Yes, with assessment	Cybersecurity risk assessment required
Restricted (HIPAA, etc.)	Yes, with assessment	Risk assessment + risk executive approval + HIPAA-eligible services

Key compliance resources:

Data classification policy
Data elements allowed in public cloud
GCP for sensitive and restricted data
Shared responsibility model for cloud platforms
HIPAA Security Program
SMPH researchers using Azure: contact platformx-support@mailplus.wisc.edu about Platform X for HIPAA workloads.

Getting help

Office hours: The RCI and Public Cloud Team hold drop-in hours on Thursdays, 2–3:15 PM via Zoom. Open to the entire UW community.
Cloud Community: Join the UW Cloud Community group — they meet every other month to share cloud computing experiences and tips.
Email: cloud-services@cio.wisc.edu
Public Cloud KnowledgeBase: kb.wisc.edu — FAQs, pricing info, and how-to guides.
ML+X Community: Join ML+X for monthly meetings on machine learning and AI.

Comments

SWE-bench: Evaluating AI on Real-World Software Engineering

Chris Endemann — Fri, 27 Feb 2026 00:00:00 GMT

SWE-bench is a benchmark designed to evaluate whether AI models can solve real-world software engineering tasks. Rather than testing code generation in isolation, SWE-bench presents models with actual GitHub issues from popular open-source Python repositories and asks them to produce a patch that resolves the issue and passes the associated test suite.

The benchmark was introduced in the 2024 paper SWE-bench: Can Language Models Resolve Real-World GitHub Issues? by Carlos E. Jimenez et al. at Princeton University.

How it works

Each SWE-bench task consists of:

A GitHub issue description — the natural-language problem statement as written by the original issue author.
A codebase snapshot — the state of the repository at the time the issue was filed.
A gold patch and test suite — the model’s output is evaluated by checking whether it passes the same tests used to validate the human-authored fix.

Models are scored on % resolved — the fraction of issues where the generated patch passes the full test suite. This makes SWE-bench more rigorous than benchmarks that only check if code compiles or passes a single test case.

SWE-bench Verified

The original SWE-bench dataset contains 2,294 tasks, but not all of them are well-specified or reliably solvable. To address this, OpenAI collaborated with the SWE-bench team to create SWE-bench Verified — a human-filtered subset of 500 tasks where annotators confirmed that:

The issue description contains enough information to identify the problem.
The test suite reliably validates correct solutions.
The task is not ambiguous or under-specified.

SWE-bench Verified is now the standard subset used for most leaderboard comparisons.

Current state of the leaderboard (early 2025)

On the Bash Only leaderboard — which evaluates all models on SWE-bench Verified using the same shell-based interface — the top models are resolving around 74% of issues:

Model	% Resolved (Verified)
Claude 4.5 Opus (medium)	74.40%
Gemini 3 Pro Preview	74.20%
Claude 4.5 Sonnet	70.60%
Claude 4 Opus (May 2025)	67.60%
GPT-5 (medium reasoning)	65.00%

These numbers have been climbing quickly — for context, the best scores were around 50% in early 2024.

Interpreting the results

It’s tempting to read “74% resolved” as meaning AI can fix 74% of real-world software bugs, but several important caveats apply:

Curated subset: SWE-bench Verified deliberately filters out ambiguous, under-documented, or hard-to-test issues. Real-world GitHub issues are messier.
Issue specification quality: In practice, much of the difficulty in software engineering lies in understanding vague requirements, reproducing bugs, and navigating large unfamiliar codebases. SWE-bench tasks are relatively well-scoped.
Single-repo Python focus: The benchmark currently draws from a set of well-maintained Python libraries (e.g., Django, scikit-learn, sympy). Generalization to other languages, less-documented codebases, or proprietary software is an open question.
No deployment or integration testing: SWE-bench tests whether a patch passes unit/integration tests, not whether it would be accepted in a real code review or function correctly at scale.

The self-driving car analogy

The trajectory of SWE-bench scores is reminiscent of autonomous driving predictions circa 2015–2017, when rapid progress on structured benchmarks led many companies to predict full autonomy was just a year or two away. A decade later, the long tail of edge cases turned out to be the hardest part.

Similarly, while the pace of improvement on SWE-bench is genuinely impressive, the remaining 25–30% of unresolved issues — and the much larger space of tasks not captured by the benchmark — may prove disproportionately difficult. Benchmarks measure a specific, well-defined slice of capability, and the gap between benchmark performance and reliable, general-purpose software engineering likely remains significant.

Why it matters

Despite these caveats, SWE-bench provides a useful signal for tracking progress in AI-assisted software engineering. It tests end-to-end problem-solving (reading an issue, understanding a codebase, writing a correct fix) rather than narrow code completion, making it one of the more meaningful benchmarks for evaluating practical coding ability.

For researchers and practitioners in ML, SWE-bench offers:

A rough barometer for how quickly AI coding capabilities are improving.
A reality check on what “AI can code” actually means today — useful for calibrating expectations when adopting AI tools.
An evaluation framework that can be adapted for domain-specific benchmarks (e.g., testing AI on bioinformatics pipelines or data analysis workflows).

Comments

OpenScholar: Scientific Literature Synthesis with Retrieval-Augmented LMs

Chris Endemann — Fri, 27 Feb 2026 00:00:00 GMT

OpenScholar is an open-source, retrieval-augmented language model (LM) designed to help researchers navigate and synthesize scientific literature. Developed by the Allen Institute for AI (AI2) and the University of Washington, OpenScholar answers scientific queries by searching a datastore of 45 million open-access papers, retrieving relevant passages, and generating citation-backed responses. The work was published in Nature in February 2026.

Unlike general-purpose LLMs that frequently hallucinate citations (GPT-4o hallucinates citations 78-90% of the time), OpenScholar achieves citation accuracy on par with human experts. In human evaluations conducted by 16 PhD-level experts, OpenScholar’s responses were preferred over expert-written ones 51% of the time for the 8B variant and 70% of the time for the GPT-4o-augmented variant.

Key features

Retrieval-augmented generation over 45M papers: OpenScholar searches a datastore of 45 million open-access papers (~236 million passage embeddings) drawn from Semantic Scholar, ensuring responses are grounded in real, retrievable literature rather than model memory.
Iterative self-feedback inference: At inference time, OpenScholar uses a self-feedback loop to iteratively refine its outputs — each iteration retrieves additional papers, improving factuality, coverage, and citation accuracy through natural language feedback.
Highly accurate citations: While GPT-4o hallucinates the vast majority of its cited papers, OpenScholar’s retrieval-first design ensures all citations correspond to real, retrievable sources.
Fully open-source: All code, model checkpoints, retriever/reranker weights, retrieval index, training data, and evaluation benchmarks are publicly available — the first complete open release of a scientific assistant LM pipeline.

Model variants and sizes

OpenScholar can be used with different underlying language models:

OpenScholar-8B (OS-8B): A fine-tuned version of Llama 3.1 8B, optimized for scientific literature synthesis. This is the flagship open-weight model. Available on Hugging Face. Despite its compact size, it outperforms GPT-4o by 6.1% in correctness on multi-paper synthesis tasks, and is 100x more cost-efficient than comparable systems like PaperQA2.
OpenScholar-GPT4o (OS-GPT4o): The OpenScholar pipeline (datastore, retriever, reranker, and self-feedback loop) applied on top of GPT-4o. This variant improves GPT-4o’s correctness by 12% and raises citation F1 from 0.1 to 39.5, demonstrating how the pipeline enhances any off-the-shelf LLM.
OpenScholar-70B (OS-70B): The pipeline applied using Llama 3.1 70B as the underlying generator, offering a middle ground between the compact 8B model and proprietary API-based options.

How the 8B model was trained

The OpenScholar-8B model was trained using the same self-feedback pipeline used at inference time, but repurposed for synthetic data generation:

Curated abstracts: Starting from 1 million curated scientific paper abstracts from the datastore.
Synthetic data generation: The self-feedback loop was used to generate 130,000 high-quality training instances, where the model iteratively refined its own outputs with retrieval feedback.
Instruction tuning: The final 13K instruction-tuning dataset (OS_Train_Data) was used to fine-tune Llama 3.1 8B using a modified version of torchtune on 8x A100 GPUs.

This approach allows a compact 8B model to achieve performance competitive with much larger proprietary models by distilling the quality of the iterative self-feedback pipeline into the model weights.

Evaluation: ScholarQABench

To rigorously evaluate scientific literature synthesis, the authors created ScholarQABench, the first large-scale multi-domain benchmark for this task:

2,967 expert-written queries and 208 long-form answers across four domains: computer science, physics, neuroscience, and biomedicine.
Evaluation metrics include correctness, citation accuracy (are cited papers real and relevant?), coverage (does the response address all aspects of the query?), and writing quality.
Human evaluations were conducted by 16 experts with PhDs across 108 questions, providing gold-standard comparisons between model and expert-written responses.

Key results on ScholarQABench:

Model	Correctness vs. GPT-4o	Citation quality	Human preference vs. expert
GPT-4o (no retrieval)	baseline	Hallucinates 78-90% of citations	Preferred 32% of the time
OpenScholar-8B	+6.1%	On par with human experts	Preferred 51% of the time
OpenScholar-GPT4o	+12%	Citation F1: 0.1 → 39.5	Preferred 70% of the time

GenAI use at UW-Madison

UW–Madison faculty, staff, students, and affiliates are required to follow campus policies relevant to AI use. Uses of generative AI that are explicitly prohibited by policy include, but are not limited to, the following:

Entering any sensitive, restricted or otherwise protected institutional data – including hard-coded passwords – into any generative AI tool or service;
Using AI-generated code for institutional IT systems or services without review by a human to verify the absence of malicious elements;
Using generative AI to violate laws; institutional policies, rules or guidelines; or agreements or contracts.

Potential use cases

Literature reviews: Rapidly synthesize the state of research on a topic with properly cited sources, saving hours of manual search and reading. Particularly useful for getting up to speed in unfamiliar fields.
Research question exploration: Ask nuanced scientific questions and receive grounded answers that point you to the most relevant papers, helping identify gaps and opportunities in the literature.
Grant writing and proposals: Quickly gather and cite supporting evidence for research proposals, ensuring claims are backed by real, verifiable literature.
Cross-disciplinary research: Explore connections between fields (e.g., neuroscience and computer science) by querying across OpenScholar’s multi-domain datastore of 45 million papers.
Teaching and mentoring: Help students and early-career researchers learn to navigate scientific literature effectively, with a tool that models good citation practices.

Comments

Deploying RAG in Bedrock vs. Local: WattBot 2025 Case Study

Nils Matteson — Tue, 17 Feb 2026 00:00:00 GMT

Many researchers are exploring retrieval-augmented generation (RAG) to build document-grounded, trustworthy AI tools, but it is often unclear how design choices around models, infrastructure, and deployment play out in practice. In this session, we present lessons learned from replicating the winning RAG system from the WattBot 2025 challenge. The challenge focuses on producing citation-backed energy and sustainability estimates for AI workloads from a fixed corpus of 30+ academic papers — or explicitly abstaining when evidence is missing. After a short overview of the winning approach, Nils Matteson and Blaise Enuh walk through how the system is implemented in practice, including:

A cloud deployment using AWS Bedrock
Local, open-source deployments (e.g., Hugging Face models on GB10 and Dell PowerEdge R7725 hardware)

The session compares performance, cost, latency, and operational tradeoffs across environments. It also includes a Streamlit-based interface demo for those looking to host their own RAG apps.

This work was conducted as part of ongoing AI infrastructure evaluation within the Research Cyberinfrastructure (RCI) office in DoIT.

Comments

Intro to AWS SageMaker for Predictive ML/AI

Chris Endemann — Fri, 07 Nov 2025 00:00:00 GMT

This introductory AWS SageMaker workshop teaches core workflows for running predictive ML/AI models in AWS SageMaker, an AWS-managed machine learning environment. Participants will learn to set up data, configure SageMaker Notebooks, manage code repositories, train and tune models, and optimize resource costs effectively within AWS. Users will benefit from tips on controlling AWS expenses and scaling models efficiently, with real-world guidance on choosing appropriate CPU and GPU resources.

UW-Madison Cloud Users

A personal AWS account is fine for this workshop. However, for long-term research use, we recommend switching to a UW-provisioned AWS account. You’ll get institutional pricing via Internet2 NET+, lower overhead on grants (26% instead of 55.5% — saving ~$2,950 per $10k in cloud costs), data protection agreements (including BAA for HIPAA), and dedicated support from the Public Cloud Team. NIH-funded researchers can get additional discounts through the STRIDES Initiative.

Request a UW AWS account | Why use a UW account? | Full details: UW Cloud Services

Cost estimate

Running through this workshop should cost approximately $5-$10 on AWS, assuming moderate usage of GPU instances and a few parallel jobs (i.e., sticking to the lesson materials). For new AWS accounts, the AWS Free Tier may cover some of these costs, including 250 hours per month of the ml.t2.medium instance for the first two months, as well as some limited S3 storage. New users may be able to complete certain parts of the workshop for free or at a significantly reduced cost. We recommend monitoring usage through the AWS Billing Dashboard to stay within the free tier and manage any extra expenses effectively.

Prerequisites

Estimated time to complete

3-5 hours: Based on running through training, tuning, and experimenting with example code setups.

Comments

Efficient KV-Cache Compression for Long-Context and Reasoning Models

Zefan Cai — Tue, 04 Nov 2025 00:00:00 GMT

Large language models (LLMs) increasingly handle very long input contexts, and their inference relies on storing key-value (KV) caches for past tokens to avoid redundant computation. However, as context length grows, the memory footprint of full KV caches becomes a major bottleneck. In this talk, Zefan Cai (CS PhD Student, UW-Madison, advised by Prof. Junjie Hu) presents two complementary approaches to compressing the KV cache, highlighting the underlying principles, trade-offs, and practical benefits for inference efficiency.

Pyramid KV

Pyramid KV is motivated by the observation that in transformer-based LLMs, attention flows from broad scopes in lower layers to narrow, focused contexts in higher layers (“pyramidal information funneling”). By allocating more cache budget in lower layers and gradually reducing it in higher layers, Pyramid KV achieves near-full performance while retaining only ~12% of the full KV cache on long-context benchmarks.

R-KV: Redundancy-aware KV Cache Compression

Building upon Pyramid KV, R-KV targets reasoning-heavy tasks (e.g., chain-of-thought) where long outputs produce very large KV caches. R-KV identifies and prunes redundant tokens in the cache, enabling roughly a 90% memory saving and ~6.6x throughput improvement, while preserving or even slightly improving accuracy compared to the full cache.

Comments

Google Colab

Chris Endemann — Fri, 26 Sep 2025 00:00:00 GMT

Google Colab is a cloud-based Jupyter notebook environment that runs entirely in the browser. It allows you to write and execute Python code without installing anything locally, making it a popular choice for machine learning, data analysis, and teaching. Colab integrates directly with Google Drive, supports GPU and TPU acceleration, and makes it easy to share notebooks and collaborate with others.

Plans and compute units

While free-tier performance is often sufficient for teaching, tutorials, and lightweight experiments, paid plans offer more predictable runtime windows, stronger GPU availability, and improved overall stability for sustained machine learning workloads. Colab Pro is often the most practical choice for researchers and students who use Colab regularly, balancing cost, runtime, and GPU access without committing to the higher price of Pro+ or worrying about “pay as you go” charges.

Plan	Cost	Compute units	Typical runtime	Memory	GPU access
Free	$0	–	Up to ~12 hours under ideal conditions (often much less; <4 isn’t uncommon), ~90 min idle timeout	~12 GB	Shared GPUs (commonly T4/K80), no guarantees
Pay As You Go	variable	Purchase as needed	Depends on units purchased	Varies	Access to faster GPUs and more memory when available
Colab Pro	$9.99/month	100 units/month	Often 12–24 hours, ~180 min idle timeout	~25 GB	More predictable access to T4/P100 GPUs and high-memory VMs
Colab Pro+	~$49.99/month	~500–600 units/month	Up to ~24 hours, ~180 min idle timeout	~25 GB	Priority access to premium GPUs (T4/P100/V100) and background execution
Colab Enterprise	custom	Custom	Custom	Custom	Integrated with GCP services (BigQuery, Vertex AI)

For the most up-to-date prices, check colab.research.google.com/signup

Data storage and mounting Google Drive

Colab notebooks themselves are stored in Google Drive, but any files you upload during a session are temporary and deleted once the session ends. To persist data between sessions, mount your Google Drive into the notebook runtime:

from google.colab import drive
drive.mount('/content/drive')

Once mounted, your Drive files are available under /content/drive/MyDrive/. For example:

import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/data.csv')

This approach is essential for storing training data, saving model checkpoints, or writing outputs that need to persist after the notebook shuts down. For larger datasets, connecting to cloud storage services like Google Cloud Storage (GCS) or AWS S3 is also possible using their Python SDKs.

Best practices and limitations

While Google Colab is one of the easiest ways to experiment with machine learning, it has several limitations to consider:

Session timeouts cannot be disabled and will interrupt long-running jobs.
GPU availability is shared and unpredictable in the free tier.
Persistent storage requires integrating with Google Drive or another external service.
Environment customization is limited compared to running Jupyter on your own server or cloud instance.

Because of these constraints, Colab is best suited for:

Rapid prototyping of notebooks and model experiments
Teaching and workshops
Exploratory data analysis and visualization
Small to medium-scale deep learning tasks

For more control, longer runtimes, or production workflows, platforms like AWS SageMaker, Google Vertex AI or campus HPC systems (e.g., CHTC) are better suited.

Comments

BadgerCompute

Chris Endemann — Wed, 24 Sep 2025 00:00:00 GMT

BadgerCompute is UW–Madison’s browser-based interactive computing service built on JupyterHub. It provides on-demand access to CPUs, memory, and GPUs without requiring any local software installation or server setup. Researchers, instructors, and students at UW-Madison can write and run code, visualize data, and develop workflows directly from their web browser using only a NetID.

BadgerCompute is similar in spirit to Google Colab: both offer hosted Jupyter notebook environments for writing and executing code interactively. However, BadgerCompute is campus-supported, requires NetID authentication, and is subject to UW data policies. It is free to use for UW affiliates, but it also comes with runtime limits (4 hours), limited storage (20 GB), and fewer GPU guarantees.

GPU access and runtime limits

GPU availability is limited and not guaranteed, but when available it is often sufficient for small to medium deep learning tasks, accelerated data analysis, or exploratory workflows. Each session runs in a containerized environment with common data science tools already installed. Sessions have the following limitations:

Maximum runtime is four hours.
Sessions without an active browser connection shut down automatically after ten minutes.
The service can support roughly 80–100 concurrent users.
GPU capacity is shared and may not be available during peak usage times.

Data storage limitations

BadgerCompute is designed for interactive computing, not data storage. Its storage model is deliberately minimal and ephemeral:

Each BadgerCompute user is allocated 20 GB of persistent storage. However, this storage is retained only for 30 days after your last login. If you do not log into BadgerCompute within 30 days, your data will be automatically deleted.
- As an alternative, Google Colab provides persistent storage via integrations with Google Drive.
In addition, the default folder when you log into BadgerCompute is NOT persistent. You will need to put files in a particular folder to save them between sessions. See our documentation for more details.
BadgerCompute is NOT suitable for work with restricted data.

Getting started

To use BadgerCompute, you must:

Have an active UW–Madison NetID.
Complete the BadgerCompute Certification Course on Canvas and wait 24 hours for access.

Working in JupyterLab

When your session launches, you will see the standard JupyterLab interface:

A file browser for navigating directories and uploading or downloading files
A launcher for creating new notebooks, terminals, or text files
A notebook interface for writing and running code interactively
A terminal for executing shell commands directly

Only the ~/work directory is persistent. Any files saved elsewhere are deleted when the session ends.

Best practices and limitations

Because BadgerCompute is a shared resource with limited capacity, plan your workflows with the following in mind:

Sessions end automatically after four hours and cannot be extended.
Sessions without an active browser connection end after ten minutes.
GPU access depends on demand and may not be available.
Storage is temporary and limited. As an alternative, Google Colab provides persistent storage via integrations with Google Drive.
Availability is not guaranteed for classes or large workshops. For larger courses, coordinate in advance or consider alternatives.

When to use BadgerCompute vs. other platforms

BadgerCompute is most useful for:

Rapid prototyping of data analysis or machine learning workflows
Teaching and demonstrations without requiring software installation
Exploratory data analysis and small-scale model development
Short tasks that benefit from GPU acceleration

For more intensive work — such as training large models, running distributed jobs, executing long-running tasks, or hosting large datasets — platforms like CHTC, AWS, GCP, or local HPC clusters are more appropriate. Users may choose to start exploratory work in BadgerCompute (or Google Colab) and transition to these systems when needed.

Learn more and get help

Documentation: badgercompute.wisc.edu/docs/
Community forum: badgercompute.wisc.edu/docs/get-help/

BadgerCompute is supported by DoIT, CHTC, and the Data Science Institute (DSI) as part of UW–Madison’s research computing ecosystem.

Comments

BioTrove

Chris Endemann — Thu, 11 Sep 2025 00:00:00 GMT

BioTrove is the largest publicly accessible biodiversity image dataset, containing 161.9 million images spanning approximately 366,000 species across three kingdoms: Animalia, Fungi, and Plantae. Curated from iNaturalist research-grade observations, BioTrove provides an unprecedented resource for training and evaluating AI models in biodiversity and ecology. It was published as a Spotlight paper at the NeurIPS 2024 Datasets and Benchmarks track.

What makes BioTrove valuable for AI?

BioTrove addresses a critical gap in AI for biodiversity: the lack of large-scale, curated, and openly available training data. While previous datasets like TREEOFLIFE-10M offered strong species diversity, BioTrove exceeds their scale by a factor of ~16x while maintaining comparable taxonomic breadth.

Each image is annotated with:

Scientific names and common names
Full taxonomic hierarchy (kingdom, phylum, class, order, family, genus, species)
Image URLs and metadata for reproducible access

Taxonomic coverage

BioTrove covers eleven major taxonomic groups, including Aves (birds), Insecta (insects), Plantae (plants), Fungi, Mammalia (mammals), Reptilia, Amphibia, Arachnida, Mollusca, Actinopterygii (ray-finned fish), and Animalia (other animals).

Key subsets and benchmarks

BioTrove-Train (~40M images, ~33K species): A curated training subset focused on seven taxonomic categories (Aves, Arachnida, Insecta, Plantae, Fungi, Mollusca, Reptilia) chosen for their biodiversity impact and underrepresentation in standard image models.
BioTrove-Balanced (~112K images): Up to 500 species per category with 50 images each, for balanced evaluation.
BioTrove-Unseen: Species with fewer than 30 instances, for testing generalization to rare or unseen species.
BioTrove-LifeStages: Evaluates recognition across developmental stages (egg, larva, pupa, adult) for five insect species.

Pretrained models (BioTrove-CLIP)

Three CLIP-based models were trained on BioTrove-Train and released on Hugging Face:

BT-CLIP-O: ViT-B/16 initialized from OpenCLIP
BT-CLIP-B: ViT-B/16 initialized from BioCLIP
BT-CLIP-M: ViT-L/14 initialized from MetaCLIP

These models are useful for biodiversity-focused image classification, retrieval, and zero-shot species identification.

Key applications

Pest control and crop monitoring: Training models to identify pest species and agricultural threats
Biodiversity assessment: Large-scale species identification and population monitoring
Environmental conservation: Detecting ecological changes and supporting wildlife monitoring
Fine-grained classification: Building models that distinguish visually similar species
Zero-shot species recognition: Leveraging CLIP-based models for identifying species not seen during training

Access

BioTrove metadata and tools are available on GitHub, with dataset cards and pretrained models on Hugging Face. The BioTrove library includes scripts for downloading, filtering, and preprocessing the data into ML-ready image-text pairs.

Comments

Clustering the BioTrove Dataset

Chris Endemann — Thu, 11 Sep 2025 00:00:00 GMT

Clustering the BioTrove Dataset was featured in the 2025 Machine Learning Marathon (MLM25). This challenge asks participants to discover genus- and species-level structure in biodiversity images using unsupervised and self-supervised learning methods.

Challenge design

Task: Cluster biodiversity images to recover taxonomic structure (genus and species groupings) without explicit labels.
Domain: Biodiversity and ecology – automated species identification can support pest control, crop monitoring, biodiversity assessment, and environmental conservation.
Data: Images drawn from BioTrove, the largest publicly accessible biodiversity image dataset (161.9 million images, ~366K species), curated from iNaturalist with research-grade annotations.
Methods: Contrastive learning, autoencoders, CLIP-based embeddings, and other unsupervised/semi-supervised approaches.

Comments

Brain-to-Text ’25: Decoding Speech from Neural Activity

Chris Endemann — Thu, 11 Sep 2025 00:00:00 GMT

Brain-to-Text ’25 was featured in the 2025 Machine Learning Marathon (MLM25). This Kaggle competition challenges participants to decode intracortical neural activity during attempted speech into text – aiming to restore communication for people with paralysis.

Challenge design

Task: Decode neural recordings from speech-related brain regions into the words a participant is attempting to say.
Domain: Brain-computer interfaces and neural speech decoding.
Data: A new intracortical speech neuroscience dataset provided for the competition.
Methods: The 2024 edition’s top approaches used RNN ensembles merged with fine-tuned large language models. The baseline achieved 9.7% word error rate; the top entrant reached 5.8%.

Comments

MaveDB: Protein Variant Effect Prediction

Chris Endemann — Thu, 11 Sep 2025 00:00:00 GMT

The MaveDB challenge was featured in the 2025 Machine Learning Marathon (MLM25). Participants explored protein language models and other ML methods to predict variant effects using data from MaveDB, an open-source database of multiplexed assays of variant effect (MAVEs) containing over 7 million variant effect measurements.

Challenge design

Task: Predict the functional impact of protein variants using deep mutational scanning data.
Domain: Computational biology – understanding how single amino acid changes affect protein function is critical for clinical variant interpretation and protein engineering.
Methods: Protein language models (e.g., ESM), fine-tuning strategies, and variant effect predictors.

Comments

AI’s Environmental Footprint: Insights and Actions

Chris Endemann — Tue, 09 Sep 2025 00:00:00 GMT

This forum explores how ML/AI practitioners can measure and reduce the environmental costs of AI. It pairs two complementary efforts: one that retrieves emissions and cost data from sustainability reports using RAG, and another that benchmarks energy, water, and carbon footprints across large language models.

WattBot: Estimating AI Emissions and Costs with RAG — Chris Endemann 02:24

Chris introduces WattBot, a Kaggle challenge and retrieval-augmented generation (RAG) framework for estimating AI emissions and compute costs. Using 35+ papers and 300+ curated Q&A pairs, teams build systems that return citation-backed answers or explicitly state when evidence is missing—promoting transparency and reproducibility in sustainability reporting.

Kaggle challenge: kaggle.com/competitions/WattBot2025/overview

How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference — Nidhal Jegham 09:07

Nidhal presents a reproducible framework to estimate per-request energy, water, and carbon use for open and proprietary LLMs. The method combines hardware assumptions (A100–H200 GPUs), data center multipliers (PUE, WUE, CIF), and a DEA-style efficiency score that balances model accuracy against environmental cost.

Preprint: https://arxiv.org/abs/2505.09598
Dashboard: https://app.powerbi.com/view?r=eyJr9

Key points

Data center efficiency and GPU generation (A100–H200) drive impact as much as model size.
Environmental multipliers like PUE (Power Usage Effectiveness) and WUE (Water Usage Effectiveness) are critical to cross-site comparisons.
Efficiency is not absolute: the Jevons paradox applies—lower per-query cost can increase overall usage.
U.S. regulation remains minimal, making voluntary transparency efforts (like Mistral’s) especially important.
Renewable energy sourcing and liquid cooling are among the most actionable interventions.
Academic and industry collaborations can close data gaps through open benchmarking.
Aggregate usage, not single-query cost, drives total environmental footprint.
Reporting environmental impact alongside accuracy metrics is an emerging best practice.

Comments

WattBot 2025: Estimating AI Emissions with RAG

Chris Endemann — Tue, 09 Sep 2025 00:00:00 GMT

WattBot was an “Active” challenge in the 2025 Machine Learning Marathon (MLM25). Teams built retrieval-augmented generation (RAG) systems to extract credible, citation-backed emissions and cost estimates for AI workloads from a corpus of 35+ peer-reviewed papers and 300+ curated Q&A pairs. Systems were expected to return citation-grounded answers or explicitly abstain when evidence was missing – promoting transparency and reproducibility in sustainability reporting.

Challenge design

Task: Given a natural-language question about AI energy use, water consumption, or carbon emissions, retrieve relevant passages from the provided corpus and generate a citation-backed answer.
Evaluation: Answers were scored on factual accuracy, proper citation, and appropriate abstention when evidence was insufficient.
Corpus: 35+ academic papers covering AI sustainability, energy benchmarking, and environmental impact reporting.

Winning approach

The winning solution by KohakuBlueleaf used a RAG pipeline that was later replicated and deployed in both AWS Bedrock and locally with open-source Hugging Face models. See the follow-up talk below for deployment details.

Comments

TorchAudio

Andrew Piela — Tue, 26 Aug 2025 00:00:00 GMT

The TorchAudio library is an audio library that allows you to incorporate modern audio signal processing into deep learning workflows. Developed by the PyTorch team, it offers GPU-friendly tools for audio I/O, feature extraction, and augmentation. TorchAudio’s I/O means it can decode files like WAV/MP3/FLAC into PyTorch tensors of shape (channels, time) with torchaudio.load and then write tensors back to audio with torchaudio.save. Furthermore, TorchAudio provides differentiable transforms (STFT, Mel/CQT spectrograms, MFCC) and SoX-based effects for augmentation (pitch/tempo changes, masking). Built on PyTorch, it slots cleanly into DataLoader and nn.Module pipelines, making it perfect for researchers and practitioners building speech recognition, music transcription, and other audio ML systems.

Key features

Feature 1: Tensor-first transforms
- MelSpectrogram, CQT, MFCC, Resample, AmplitudeToDB.
Feature 2: Audio I/O
- Load/save WAV/MP3/FLAC straight to torch.Tensor (which is CPU/GPU-ready).
Feature 3: Augmentation
- Pitch/tempo changes, masking, noise via SoX effects.
Performance: Offers batches and GPU acceleration through PyTorch and it also works well with DataLoader.

Integration and compatibility

TorchAudio integrates with various machine learning frameworks and libraries, making it versatile for a range of tasks.

Frameworks Supported: PyTorch
Compatible Libraries: NumPy, SciPy, librosa (complementary analysis), pretty_midi (export MIDI)
Installation Instructions: ‘pip install torchaudio’

Use cases

Here are some examples of how TorchAudio can be applied to different machine learning tasks.

Use Case 1: Wav file to midi transcription
- Preprocess audio to log-mel/CQT tensors, train CNN/CRNN models for frame-wise notes/onsets.
Use Case 2: Data augmentation
- Pitch/tempo shifts to increase robustness.

Tutorials and resources

Getting started

Official Tutorial

code snippet (this code loads audio as a tensor and resamples it so every file has the same sample rate):

# wav -> log-mel spectrogram tensor (which would be model input)
import torchaudio

w, sr = torchaudio.load("path/to/audio.wav")
w = w.mean(0, keepdim=True) if w.size(0) > 1 else w
if sr != 22050: w = torchaudio.transforms.Resample(sr, 22050)(w); sr = 22050

mel = torchaudio.transforms.MelSpectrogram(sample_rate=sr, n_fft=2048, hop_length=512, n_mels=128)
X_db = torchaudio.transforms.AmplitudeToDB(stype="power")(mel(w))
print(X_db.shape)  # shape would be (1, 128, T)

High-level tips for effective use

Optimization: precompute log-mels for quicker training
Memory Management: use small-ish n_mels and hop_length and also batch by time frames
Common Pitfalls: a common mistake is having inconsistent sample rates and hop lengths, so you should make sure to keep them the same in training and inference

Comments

GeoDeepDive: Unlocking Knowledge from Scientific Literature

Devanshi Jain — Thu, 21 Aug 2025 00:00:00 GMT

Overview

GeoDeepDive (GDD) is a cyberinfrastructure project designed to accelerate scientific discovery by extracting information from the vast and growing body of published scientific literature. While its roots are in geology, its applications span any domain that relies on published texts, including biology, materials science, medicine, and social sciences.

At its core, GDD is a massive database of over 15 million scientific documents (articles, theses, reports) that have been processed through a high-performance computing pipeline. This pipeline performs:

Optical Character Recognition (OCR) to convert scanned PDFs into machine-readable text.
Natural Language Processing (NLP) to parse sentences, identify parts of speech, and perform named entity recognition (e.g., finding mineral names, locations, species).
Relation Extraction to find and catalog relationships between entities (e.g., “mineral X is found at location Y”).

The result is not just a collection of texts, but a structured, queryable knowledge graph. Researchers can use GDD’s public API to ask complex questions that would be impossible to answer by manual literature review, such as “find all papers that mention a specific fossil and its geological age” or “extract all measured values of a particular chemical compound.”

Prerequisites

Basic familiarity with Python and making HTTP requests.
A GitHub account (to use GDD’s public API).
An understanding of basic NLP concepts (Token, Sentence, Named Entity) is helpful but not strictly required to run the example.

Key Concepts and Definitions

Concept	Definition
Document	Any processed text unit in the GDD database, typically a scientific publication.
NLP	Natural Language Processing, the field of AI concerned with interactions between computers and human language.
Named Entity Recognition (NER)	An NLP task to identify and classify key information (entities) in text into predefined categories like persons, organizations, locations, etc. In GDD, these are often scientific terms.
API (Application Programming Interface)	A set of rules and tools that allows different software applications to communicate with each other. GDD provides an API to query its database programmatically.
JSON	JavaScript Object Notation, a lightweight data-interchange format that is easy for humans to read and write and for machines to parse and generate. It is the primary format for data returned by the GDD API.

Tutorial: Querying the GeoDeepDive API for Mineral Mentions

This tutorial will guide you through a simple example of using Python to query the GeoDeepDive API to find sentences that mention the mineral “stishovite.”

Step 1: Get Your GitHub Token

The GDD API uses GitHub OAuth for authentication. You need to generate a personal access token.

Go to your GitHub Settings.
Navigate to Developer settings > Personal access tokens > Tokens (classic).
Click Generate new token (classic). Give it a descriptive note (e.g., “GeoDeepDive API”).
Select the public_repo scope. This is sufficient.
Click Generate token and copy the token immediately (you won’t see it again!).

Step 2: Set Up Your Python Environment

We’ll use the requests library to make HTTP calls. Let’s install it:

!pip install requests

Step 3: Configure Your Authentication

Now, let’s set up your authentication. Replace the placeholders with your actual GitHub credentials:

import requests
import json

# Replace these with your actual GitHub credentials
GITHUB_USERNAME = "YourGitHubUsername" 
GITHUB_TOKEN = "YOUR_GITHUB_TOKEN"  # Replace with the token you generated

print("Authentication configured successfully!")

Step 4: Query the GeoDeepDive API

Let’s search for documents mentioning the mineral “stishovite”:

# The public endpoint for the GDD API
url = "https://geodeepdive.org/api/articles"

# The parameters for our query. We want sentences about 'stishovite'
params = {
    "term": "stishovite",   # The word or phrase to search for
    "full_results": True,   # Get full details, including sentences
    "sentences": True       # Include the sentences in the response
}

# Make the GET request to the API with authentication
response = requests.get(url, params=params, auth=(GITHUB_USERNAME, GITHUB_TOKEN))

# Check if the request was successful
if response.status_code == 200:
    data = response.json()
    print(f" Found {data['success']['total']} documents mentioning 'stishovite'.\n")
    
    # Loop through the first few documents and print relevant sentences
    for i, doc in enumerate(data['success']['data'][:3]):  # Look at first 3 docs
        print(f" Document {i+1}: {doc['_gddid']}")
        print(f"   Title: {doc.get('title', 'No title available')}")
        print("   Sentences found:")
        
        stishovite_sentences = [s for s in doc['sentences'] if 'stishovite' in s['text'].lower()]
        
        for j, sentence in enumerate(stishovite_sentences[:2]):  # Show first 2 sentences per doc
            print(f"     {j+1}. {sentence['text']}")
        
        print(f"   Total sentences with 'stishovite': {len(stishovite_sentences)}")
        print("\n" + "─" * 80 + "\n")
else:
    print(f" Error: {response.status_code}")
    print(response.text)

Step 5: Advanced Query - Filter by Journal

Let’s try a more specific query to find papers in specific journals:

# Search for stishovite in specific journals
advanced_params = {
    "term": "stishovite",
    "journal": "science,nature,geology",  # Filter by journal names
    "full_results": True,
    "sentences": True,
    "limit": 5  # Limit to 5 results
}

advanced_response = requests.get(url, params=advanced_params, auth=(GITHUB_USERNAME, GITHUB_TOKEN))

if advanced_response.status_code == 200:
    advanced_data = advanced_response.json()
    print(f"🔍 Found {advanced_data['success']['total']} documents in specified journals.")
    
    if advanced_data['success']['total'] > 0:
        print("\n📊 Journal distribution:")
        journals = {}
        for doc in advanced_data['success']['data']:
            journal = doc.get('journal', 'Unknown')
            journals[journal] = journals.get(journal, 0) + 1
        
        for journal, count in journals.items():
            print(f"   {journal}: {count} documents")
    else:
        print("No documents found in the specified journals.")
else:
    print(f" Advanced query error: {advanced_response.status_code}")

Step 6: Export Results (Optional)

Let’s export the results to a JSON file for further analysis:

import json
from datetime import datetime

# Export the results
if response.status_code == 200:
    export_data = {
        "query": "stishovite",
        "execution_date": datetime.now().isoformat(),
        "total_documents": data['success']['total'],
        "sample_documents": data['success']['data'][:5]  # First 5 documents
    }
    
    with open('geodeepdive_results.json', 'w') as f:
        json.dump(export_data, f, indent=2)
    
    print("💾 Results exported to 'geodeepdive_results.json'")
    
    # Show a preview
    print("\n📋 Preview of exported data:")
    print(f"Total documents: {export_data['total_documents']}")
    print(f"Sample size: {len(export_data['sample_documents'])}")

Summary

GeoDeepDive is a powerful tool for moving beyond simple keyword searches to true knowledge extraction. By providing programmatic access to a deeply processed corpus of scientific literature, it enables researchers to ask complex, data-driven questions at a scale that was previously impossible.

Key Takeaway: GDD turns unstructured text into structured, queryable data.
What we accomplished: We successfully queried the GeoDeepDive API, retrieved scientific documents mentioning “stishovite,” filtered results by journal, and exported the data for further analysis.

Additional Resources

Note: Remember to keep your GitHub token secure and never share it publicly. For production use, consider using environment variables or secure secret management.

Comments

MegaDetector: Pre-Trained Animal Object Detection for Camera Trap Images

Ryan Bemowski — Wed, 20 Aug 2025 00:00:00 GMT

MegaDetector is an AI model that identifies animals, people, and vehicles in camera trap images (which also makes it useful for eliminating blank images). This model is trained on several million images from a variety of ecosystems. Introduced in 2018 by Dan Morris, supported by Microsoft, via Sarah Beery in the paper “The MegaDetector: Large-Scale Deployment of Computer Vision for Conservation and Biodiversity Monitoring”, MegaDetector has become a widely used model for animal, human, and vehicle detection in camera trap images. It being specifically trained on camera trap images allows it to excel in wildlife conservation filtering tasks. It is being used in numerous conservation efforts, including at the Wisconsin Department of Natural Resources (WDNR) Snapshot Wisconsin trail camera project.

Key features

Architecture: MegaDetector v5 is a pre-trained model that relies on the You Only Look Once (YOLO) v5 architecture.

Timeline context

MegaDetector builds on advances in the YOLO family of object detection models and is widely adopted by researchers working with camera trap data.

YOLOv5 (2021): An updated framework base on the YOLO object detection model.
YOLO (2015): The initial introduction of the YOLO object detection model.

Model playground

Tutorials and Getting Started Notebooks

MegaDetector GitHub page: MegaDetector GitHub page has a lot of good information about installation and use.

ML+X Nexus: All Resources

Understanding Quantization and Precision

Prerequisites

Setup

Part 1: What is precision?

The ruler analogy

How floating-point numbers are stored

Key insight: precision controls memory per parameter

Part 2: Loading a model at different precisions

Helper: Measure GPU memory

FP32 (default)

FP16 (half precision)

BF16 (bfloat16)

Compare: precision vs. memory

Precision reduction vs. quantization: what’s the difference?

Part 3: Quantization — mapping weights to fewer values

Using bitsandbytes for quantization

8-bit quantization

4-bit quantization (NF4)

Compare all configurations

Part 4: The PyTorch parameter count gotcha

What gets quantized (and what doesn’t)

Why numel() is misleading for quantized parameters

Correctly counting parameters

What’s happening under the hood

Summary

Part 5: When to quantize (and how aggressively)

Quantization is primarily an inference technique

A bigger model at lower precision often beats a smaller model at full precision

Speed and cost: quantization isn’t just about fitting

How low can you go?

Decision flowchart

Quick reference

Cleanup

Related resources

Comments

Intro to GCP for Machine Learning & AI

Cost estimate

Prerequisites

Estimated time to complete

Related resources

Comments

How to Contribute?

Our philosophy: share what you know

Using generative AI when contributing

What kinds of resources can I share?

What makes a good post?

Examples of good posts

Need topic ideas?

How to submit a new post

Quick overview

Step-by-step instructions

1. Create an Issue on GitHub

2. Fork and clone the repo

3. Create a branch and write your post

4. Commit, push, and open a Pull Request

5. Review

Resource templates

Where to place your file

Previewing locally (optional)

Contributing a Jupyter notebook

How to improve an existing post

Comments

Comments

UW-Madison Cloud Services (AWS, GCP, Azure)

Why run ML/AI in the cloud?

Cloud vs. university HPC clusters

When does model size justify cloud compute?

LLM APIs: skip the infrastructure entirely

A note on cloud costs

Why use a UW-provisioned account?

Paying for cloud compute with grant money

Lower overhead (Cloud Computing Pilot)

NIH STRIDES Initiative

How to request a UW cloud account

Research credits & training

Research credits

Grants for social impact & sustainability research

Free cloud training

Data protection & compliance

Why `numel()` is misleading for quantized parameters