Center for High Throughput Computing (CHTC)

Compute
UW-Madison
GPU
HPC
HTC
Author

Chris Endemann

Published

June 25, 2024

Established in 2006, the Center for High Throughput Computing (CHTC) is UW-Madison’s core computational service provider for large-scale computing. CHTC offers two main systems — a High Throughput Computing (HTC) pool and a High Performance Computing (HPC) cluster — along with GPUs, high-memory servers, data storage, personalized consulting, and classroom support, all at no cost to UW-Madison researchers.

CHTC services are open to UW-Madison staff, students, faculty, and external collaborators (with a UW faculty/staff sponsor).

TipGet started now

Request an account to start using CHTC. A Research Computing Facilitator will follow up to discuss your computational needs and help you get set up.

Why use CHTC for ML/AI?

If you’re a UW-Madison researcher looking to train models, run experiments, or scale ML workflows beyond your laptop, CHTC is the first place to look:

  • Free GPU access — A100s (40 and 80 GB), H100s (80 GB), and more are available at no cost through the GPU Lab.
  • No cloud bills — Unlike AWS/GCP/Azure, you don’t pay per hour. CHTC is funded institutionally.
  • Scale to thousands of jobs — HTCondor is designed for massive throughput: submit thousands of jobs from a single submit file for hyperparameter sweeps, cross-validation, or batch inference.
  • Dedicated support — Research Computing Facilitators provide one-on-one consultations, help you optimize workflows, and troubleshoot issues.
  • Handles most model sizes — The GPU Lab’s high-VRAM GPUs (up to 80 GB on H100s) can run inference on models up to ~70B parameters with quantization, or fine-tune smaller models on a single GPU.

CHTC vs. cloud: when to use which

CHTC and UW cloud services (AWS, GCP, Azure) are complementary — many researchers use both. Here’s how they compare for ML/AI workloads:

Factor CHTC Cloud (AWS, GCP, Azure)
Cost Free for all UW researchers Pay per hour; UW accounts get discounted rates and lower grant overhead
GPU availability Shared queue — wait times during peak periods On-demand (subject to quota)
Job time limits 72 hrs default (CPU); varies for GPU; checkpointing needed for long runs No time limits — jobs run as long as needed
Multi-GPU / NVLink Not available — single-GPU or data-parallel only Available on demand — needed for frontier-scale training
Interactive use Limited — 4 hr interactive GPU sessions for testing; primarily a batch system Full interactive access — Jupyter, VS Code, persistent VMs
Persistent services Not supported — batch jobs only Host model endpoints, RAG apps, dashboards, databases
Software environment Docker/Apptainer containers — you build and manage them Prebuilt ML containers + bring-your-own; managed platforms (Vertex AI, SageMaker) handle more plumbing
Scaling Thousands of independent jobs via HTCondor — excellent for hyperparameter sweeps, cross-validation Scale up (bigger machines) or out (more machines); managed orchestration
Support Free 1-on-1 Research Computing Facilitator consultations UW Public Cloud Team support; platform documentation
Data compliance On-campus infrastructure Requires risk assessment for sensitive/restricted data

Use CHTC when you…

  • Have batch workloads that fit on a single GPU (most ML training and inference)
  • Want to run hundreds or thousands of parallel experiments at no cost
  • Can work within job time limits (with checkpointing for longer runs)
  • Are comfortable with (or willing to learn) command-line job submission

Use cloud when you…

  • Need NVLink multi-GPU for frontier-scale training (70B+ parameters, full fine-tuning of 13B+ models)
  • Need persistent services — model endpoints, RAG applications, interactive dashboards
  • Have a hard deadline and can’t wait in a shared queue
  • Need specific hardware CHTC doesn’t have (e.g., TPUs, B200s)
  • Want managed ML platforms (Vertex AI, SageMaker, Azure ML) to handle infrastructure

Use both when you…

  • Develop and test on CHTC, then scale to cloud for final large-scale runs
  • Run batch experiments on CHTC but deploy trained models as cloud endpoints
  • Use CHTC for free GPU experimentation while reserving cloud budget for specialized hardware
TipThe learning curve is real — but help is available

CHTC is a batch system, not a notebook environment. You’ll write submit files, package environments in containers, and manage file transfers. That said, CHTC’s Research Computing Facilitators provide free 1-on-1 help and are genuinely great — reach out early and often. Most researchers find that after the initial setup, CHTC becomes a core part of their workflow.

Available hardware

HTC system

The HTC system is CHTC’s primary resource, managed by HTCondor. It is optimized for running many independent jobs in parallel — ideal for hyperparameter searches, cross-validation folds, batch inference, Monte Carlo simulations, and similar workflows.

  • ~30,000 CPU cores across shared execute nodes
  • Several high-memory servers with terabytes of RAM for memory-intensive single-node jobs
  • Dozens of GPUs available through the GPU Lab
  • 200 Gbps backbone network connectivity

GPU Lab

The GPU Lab is a pool of shared GPU servers within the HTC system. Any CHTC user can opt in. Available GPU hardware includes:

GPU VRAM Notes
A100 40 GB or 80 GB Strong general-purpose ML GPU; widely available in the GPU Lab
H100 80 GB Higher performance and memory bandwidth; available via DSI H100 cluster (DSI Affiliates get priority)
L40 48 GB Available via DSI L40 cluster

To request GPUs in your jobs, add GPU requirements to your HTCondor submit file. See the GPU jobs guide for details.

TipBackfill for short jobs

If your jobs run in under 4–6 hours (or can checkpoint that frequently), you can access additional GPU servers beyond the GPU Lab as backfill — significantly increasing your available capacity. CHTC strongly recommends this for short or checkpointable jobs.

HPC cluster

The HPC cluster is a traditional shared-memory cluster managed with SLURM. It is suited for tightly coupled parallel jobs (e.g., MPI-based simulations) that need fast communication between nodes. The HPC cluster also supports Apptainer containers and software management via Spack.

What model sizes can CHTC handle?

The GPU you need depends on your model size and what you’re doing with it (inference vs. training vs. fine-tuning). Here’s a rough guide based on CHTC’s available hardware:

Task Model size Precision Min VRAM needed CHTC GPU options
Inference < 7B FP16 ~14 GB A100 40 GB, L40 48 GB, A100 80 GB, H100 80 GB
Inference 7–13B FP16 ~14–26 GB A100 40 GB, L40 48 GB, A100 80 GB, H100 80 GB
Inference 13–30B INT4 (quantized) ~10–20 GB A100 40 GB, L40 48 GB, A100 80 GB, H100 80 GB
Inference 30–70B INT4 (quantized) ~20–40 GB L40 48 GB, A100 80 GB, H100 80 GB
Inference 70B INT4 (quantized) ~40–45 GB L40 48 GB, A100 80 GB, H100 80 GB
Fine-tuning (LoRA/QLoRA) 7–13B Mixed (BF16 + INT4 base) ~20–30 GB A100 40 GB, L40 48 GB, A100 80 GB, H100 80 GB
Fine-tuning (LoRA/QLoRA) 30–70B QLoRA (INT4 base) ~40–80 GB A100 80 GB, H100 80 GB
Full fine-tuning 7B BF16 ~60–70 GB A100 80 GB, H100 80 GB
Full fine-tuning 13B+ BF16 > 80 GB (multi-GPU) Needs NVLink — use cloud
Training from scratch > 1B BF16 Multi-GPU w/ NVLink Needs NVLink — use cloud

Key takeaways:

  • Quantization extends your reach significantly. A 70B model at full FP16 precision needs ~140 GB of VRAM (impossible on a single GPU), but quantized to INT4 it fits in ~40 GB — within range of an A100 80 GB or H100.
  • LoRA/QLoRA makes fine-tuning practical on single GPUs. You don’t need to update all parameters — parameter-efficient methods let you fine-tune large models with a fraction of the memory.
  • Inference is cheaper than training. Running a model forward (inference) requires roughly 2 bytes per parameter at FP16, while training requires 2–4x more for gradients and optimizer states.
  • Beyond 80 GB, you need multi-GPU with NVLink — which CHTC doesn’t offer. For those workloads, see UW cloud services.
Note

These are rough estimates. Actual VRAM usage depends on batch size, sequence length, framework overhead, and the specific model architecture. When in doubt, start with a small test on an interactive GPU job.

Job time limits

Job type Runtime limit Notes
Standard HTC jobs 72 hours Default; contact facilitators if you need longer
GPU Lab batch jobs Varies by GPU tier Check the GPU Lab page for current limits
Interactive GPU jobs 4 hours, 1 GPU For testing and debugging
Backfill jobs (group-owned GPUs) No guaranteed runtime Jobs may be preempted; best for short or checkpointable work
TipLong training runs

If your training exceeds the job time limit, implement checkpointing — save model weights periodically so you can resume from the last checkpoint in a new job. CHTC’s machine learning guide covers this workflow. You can also contact facilitators about extended runtimes.

Running ML jobs on CHTC

CHTC’s machine learning guide is the best starting point for deep learning workflows. Key considerations:

  • Software environment: CHTC supports containers (Docker/Apptainer) for packaging your ML environment — PyTorch, TensorFlow, JAX, etc. This gives you full control over your software stack.
  • Data transfer: Use /home for small files (< 100 MB) and /staging for large datasets. HTCondor’s file transfer mechanism handles up to ~500 MB total per job; for larger data, see the file availability guide.
  • Scaling out: HTCondor makes it easy to submit thousands of parallel jobs — useful for hyperparameter sweeps, k-fold cross-validation, or processing many datasets independently.
  • Checkpointing: For long training runs, save checkpoints regularly so jobs can resume if they hit time limits or are preempted on backfill hardware.

Data storage

Location Purpose Default quota
/home Small files, scripts, submit files 20 GB
/staging Large datasets and model files Request as needed

Temporary working space can support up to hundreds of terabytes for active jobs. For long-term storage needs, talk to your CHTC facilitator about options.

Scaling beyond CHTC

If CHTC’s local resources aren’t enough, you can opt into additional compute pools:

  • UW Grid — Access additional campus resources beyond the CHTC pool.
  • OS Pool (Open Science Pool) — An NSF-supported network of 100+ universities, national labs, and research collaborations. Jobs that run under ~10 hours and use less than ~20 GB of data per job are good candidates.

These options let you burst to significantly more CPUs and GPUs without any additional cost.

Citing CHTC in publications and proposals

Publications

CHTC asks that you cite their services in any publications that benefited from CHTC resources:

Center for High Throughput Computing. (2006). Center for High Throughput Computing. doi:10.21231/GNT1-HW21

Grant proposals

CHTC provides boilerplate language and letters of support for grant proposals. Key points to include: all standard CHTC services are free, CHTC has 20+ full-time staff, and local resources total ~30,000 CPU cores with access to national-scale computing through the OS Pool.

Getting help

  • Email: chtc@cs.wisc.edu
  • Office hours: Tuesdays, 10:30 AM – 12:00 PM (virtual via Zoom — see a facilitator’s email signature or your CHTC login message for the link)
  • Appointments: Email to arrange a meeting outside of office hours.
  • Guides: HTC guides | HPC guides | GPU jobs guide
  • Location: WI Institute for Discovery, 333 N Randall Ave, Room 2262

Comments