Center for High Throughput Computing (CHTC)
Established in 2006, the Center for High Throughput Computing (CHTC) is UW-Madison’s core computational service provider for large-scale computing. CHTC offers two main systems — a High Throughput Computing (HTC) pool and a High Performance Computing (HPC) cluster — along with GPUs, high-memory servers, data storage, personalized consulting, and classroom support, all at no cost to UW-Madison researchers.
CHTC services are open to UW-Madison staff, students, faculty, and external collaborators (with a UW faculty/staff sponsor).
Request an account to start using CHTC. A Research Computing Facilitator will follow up to discuss your computational needs and help you get set up.
Why use CHTC for ML/AI?
If you’re a UW-Madison researcher looking to train models, run experiments, or scale ML workflows beyond your laptop, CHTC is the first place to look:
- Free GPU access — A100s (40 and 80 GB), H100s (80 GB), and more are available at no cost through the GPU Lab.
- No cloud bills — Unlike AWS/GCP/Azure, you don’t pay per hour. CHTC is funded institutionally.
- Scale to thousands of jobs — HTCondor is designed for massive throughput: submit thousands of jobs from a single submit file for hyperparameter sweeps, cross-validation, or batch inference.
- Dedicated support — Research Computing Facilitators provide one-on-one consultations, help you optimize workflows, and troubleshoot issues.
- Handles most model sizes — The GPU Lab’s high-VRAM GPUs (up to 80 GB on H100s) can run inference on models up to ~70B parameters with quantization, or fine-tune smaller models on a single GPU.
CHTC vs. cloud: when to use which
CHTC and UW cloud services (AWS, GCP, Azure) are complementary — many researchers use both. Here’s how they compare for ML/AI workloads:
| Factor | CHTC | Cloud (AWS, GCP, Azure) |
|---|---|---|
| Cost | Free for all UW researchers | Pay per hour; UW accounts get discounted rates and lower grant overhead |
| GPU availability | Shared queue — wait times during peak periods | On-demand (subject to quota) |
| Job time limits | 72 hrs default (CPU); varies for GPU; checkpointing needed for long runs | No time limits — jobs run as long as needed |
| Multi-GPU / NVLink | Not available — single-GPU or data-parallel only | Available on demand — needed for frontier-scale training |
| Interactive use | Limited — 4 hr interactive GPU sessions for testing; primarily a batch system | Full interactive access — Jupyter, VS Code, persistent VMs |
| Persistent services | Not supported — batch jobs only | Host model endpoints, RAG apps, dashboards, databases |
| Software environment | Docker/Apptainer containers — you build and manage them | Prebuilt ML containers + bring-your-own; managed platforms (Vertex AI, SageMaker) handle more plumbing |
| Scaling | Thousands of independent jobs via HTCondor — excellent for hyperparameter sweeps, cross-validation | Scale up (bigger machines) or out (more machines); managed orchestration |
| Support | Free 1-on-1 Research Computing Facilitator consultations | UW Public Cloud Team support; platform documentation |
| Data compliance | On-campus infrastructure | Requires risk assessment for sensitive/restricted data |
Use CHTC when you…
- Have batch workloads that fit on a single GPU (most ML training and inference)
- Want to run hundreds or thousands of parallel experiments at no cost
- Can work within job time limits (with checkpointing for longer runs)
- Are comfortable with (or willing to learn) command-line job submission
Use cloud when you…
- Need NVLink multi-GPU for frontier-scale training (70B+ parameters, full fine-tuning of 13B+ models)
- Need persistent services — model endpoints, RAG applications, interactive dashboards
- Have a hard deadline and can’t wait in a shared queue
- Need specific hardware CHTC doesn’t have (e.g., TPUs, B200s)
- Want managed ML platforms (Vertex AI, SageMaker, Azure ML) to handle infrastructure
Use both when you…
- Develop and test on CHTC, then scale to cloud for final large-scale runs
- Run batch experiments on CHTC but deploy trained models as cloud endpoints
- Use CHTC for free GPU experimentation while reserving cloud budget for specialized hardware
CHTC is a batch system, not a notebook environment. You’ll write submit files, package environments in containers, and manage file transfers. That said, CHTC’s Research Computing Facilitators provide free 1-on-1 help and are genuinely great — reach out early and often. Most researchers find that after the initial setup, CHTC becomes a core part of their workflow.
Available hardware
HTC system
The HTC system is CHTC’s primary resource, managed by HTCondor. It is optimized for running many independent jobs in parallel — ideal for hyperparameter searches, cross-validation folds, batch inference, Monte Carlo simulations, and similar workflows.
- ~30,000 CPU cores across shared execute nodes
- Several high-memory servers with terabytes of RAM for memory-intensive single-node jobs
- Dozens of GPUs available through the GPU Lab
- 200 Gbps backbone network connectivity
GPU Lab
The GPU Lab is a pool of shared GPU servers within the HTC system. Any CHTC user can opt in. Available GPU hardware includes:
| GPU | VRAM | Notes |
|---|---|---|
| A100 | 40 GB or 80 GB | Strong general-purpose ML GPU; widely available in the GPU Lab |
| H100 | 80 GB | Higher performance and memory bandwidth; available via DSI H100 cluster (DSI Affiliates get priority) |
| L40 | 48 GB | Available via DSI L40 cluster |
To request GPUs in your jobs, add GPU requirements to your HTCondor submit file. See the GPU jobs guide for details.
If your jobs run in under 4–6 hours (or can checkpoint that frequently), you can access additional GPU servers beyond the GPU Lab as backfill — significantly increasing your available capacity. CHTC strongly recommends this for short or checkpointable jobs.
HPC cluster
The HPC cluster is a traditional shared-memory cluster managed with SLURM. It is suited for tightly coupled parallel jobs (e.g., MPI-based simulations) that need fast communication between nodes. The HPC cluster also supports Apptainer containers and software management via Spack.
What model sizes can CHTC handle?
The GPU you need depends on your model size and what you’re doing with it (inference vs. training vs. fine-tuning). Here’s a rough guide based on CHTC’s available hardware:
| Task | Model size | Precision | Min VRAM needed | CHTC GPU options |
|---|---|---|---|---|
| Inference | < 7B | FP16 | ~14 GB | A100 40 GB, L40 48 GB, A100 80 GB, H100 80 GB |
| Inference | 7–13B | FP16 | ~14–26 GB | A100 40 GB, L40 48 GB, A100 80 GB, H100 80 GB |
| Inference | 13–30B | INT4 (quantized) | ~10–20 GB | A100 40 GB, L40 48 GB, A100 80 GB, H100 80 GB |
| Inference | 30–70B | INT4 (quantized) | ~20–40 GB | L40 48 GB, A100 80 GB, H100 80 GB |
| Inference | 70B | INT4 (quantized) | ~40–45 GB | L40 48 GB, A100 80 GB, H100 80 GB |
| Fine-tuning (LoRA/QLoRA) | 7–13B | Mixed (BF16 + INT4 base) | ~20–30 GB | A100 40 GB, L40 48 GB, A100 80 GB, H100 80 GB |
| Fine-tuning (LoRA/QLoRA) | 30–70B | QLoRA (INT4 base) | ~40–80 GB | A100 80 GB, H100 80 GB |
| Full fine-tuning | 7B | BF16 | ~60–70 GB | A100 80 GB, H100 80 GB |
| Full fine-tuning | 13B+ | BF16 | > 80 GB (multi-GPU) | Needs NVLink — use cloud |
| Training from scratch | > 1B | BF16 | Multi-GPU w/ NVLink | Needs NVLink — use cloud |
Key takeaways:
- Quantization extends your reach significantly. A 70B model at full FP16 precision needs ~140 GB of VRAM (impossible on a single GPU), but quantized to INT4 it fits in ~40 GB — within range of an A100 80 GB or H100.
- LoRA/QLoRA makes fine-tuning practical on single GPUs. You don’t need to update all parameters — parameter-efficient methods let you fine-tune large models with a fraction of the memory.
- Inference is cheaper than training. Running a model forward (inference) requires roughly 2 bytes per parameter at FP16, while training requires 2–4x more for gradients and optimizer states.
- Beyond 80 GB, you need multi-GPU with NVLink — which CHTC doesn’t offer. For those workloads, see UW cloud services.
These are rough estimates. Actual VRAM usage depends on batch size, sequence length, framework overhead, and the specific model architecture. When in doubt, start with a small test on an interactive GPU job.
Job time limits
| Job type | Runtime limit | Notes |
|---|---|---|
| Standard HTC jobs | 72 hours | Default; contact facilitators if you need longer |
| GPU Lab batch jobs | Varies by GPU tier | Check the GPU Lab page for current limits |
| Interactive GPU jobs | 4 hours, 1 GPU | For testing and debugging |
| Backfill jobs (group-owned GPUs) | No guaranteed runtime | Jobs may be preempted; best for short or checkpointable work |
If your training exceeds the job time limit, implement checkpointing — save model weights periodically so you can resume from the last checkpoint in a new job. CHTC’s machine learning guide covers this workflow. You can also contact facilitators about extended runtimes.
Running ML jobs on CHTC
CHTC’s machine learning guide is the best starting point for deep learning workflows. Key considerations:
- Software environment: CHTC supports containers (Docker/Apptainer) for packaging your ML environment — PyTorch, TensorFlow, JAX, etc. This gives you full control over your software stack.
- Data transfer: Use
/homefor small files (< 100 MB) and/stagingfor large datasets. HTCondor’s file transfer mechanism handles up to ~500 MB total per job; for larger data, see the file availability guide. - Scaling out: HTCondor makes it easy to submit thousands of parallel jobs — useful for hyperparameter sweeps, k-fold cross-validation, or processing many datasets independently.
- Checkpointing: For long training runs, save checkpoints regularly so jobs can resume if they hit time limits or are preempted on backfill hardware.
Data storage
| Location | Purpose | Default quota |
|---|---|---|
/home |
Small files, scripts, submit files | 20 GB |
/staging |
Large datasets and model files | Request as needed |
Temporary working space can support up to hundreds of terabytes for active jobs. For long-term storage needs, talk to your CHTC facilitator about options.
Scaling beyond CHTC
If CHTC’s local resources aren’t enough, you can opt into additional compute pools:
- UW Grid — Access additional campus resources beyond the CHTC pool.
- OS Pool (Open Science Pool) — An NSF-supported network of 100+ universities, national labs, and research collaborations. Jobs that run under ~10 hours and use less than ~20 GB of data per job are good candidates.
These options let you burst to significantly more CPUs and GPUs without any additional cost.
Citing CHTC in publications and proposals
Publications
CHTC asks that you cite their services in any publications that benefited from CHTC resources:
Center for High Throughput Computing. (2006). Center for High Throughput Computing. doi:10.21231/GNT1-HW21
Grant proposals
CHTC provides boilerplate language and letters of support for grant proposals. Key points to include: all standard CHTC services are free, CHTC has 20+ full-time staff, and local resources total ~30,000 CPU cores with access to national-scale computing through the OS Pool.
Getting help
- Email: chtc@cs.wisc.edu
- Office hours: Tuesdays, 10:30 AM – 12:00 PM (virtual via Zoom — see a facilitator’s email signature or your CHTC login message for the link)
- Appointments: Email to arrange a meeting outside of office hours.
- Guides: HTC guides | HPC guides | GPU jobs guide
- Location: WI Institute for Discovery, 333 N Randall Ave, Room 2262
Comments