UW-Madison Cloud Services (AWS, GCP, Azure) – Nexus: Crowdsourced ML Resources

Why run ML/AI in the cloud?

You have ML/AI code that works on your laptop. But at some point you need more — a bigger GPU (or several), a dataset that won’t fit on disk, or the ability to run dozens of training experiments overnight. You could invest in local hardware or compete for time on a shared HPC cluster, but cloud platforms let you rent exactly the hardware you need, for exactly as long as you need it, and then shut it down.

Cloud vs. university HPC clusters

Most universities offer shared HPC clusters with GPUs. These are excellent resources — but they have tradeoffs worth understanding:

Factor	University HPC	Cloud (AWS, GCP, Azure)
Cost	Free or subsidized	Pay per hour
GPU availability	Shared queue; wait times during peak periods and per-job runtime limits (often 24–72 hrs) that may require checkpointing long training runs	On-demand (subject to quota); jobs run as long as needed
Hardware variety	Fixed hardware refresh cycle (3–5 years)	Latest GPUs available immediately (A100, H100, B200)
Scaling	Limited by cluster size	Spin up hundreds of jobs in parallel
Multi-GPU / NVLink	Sometimes available, depends on cluster	Available on demand — essential for training, fine-tuning, or serving large LLMs that don’t fit in a single GPU’s memory
Job orchestration	Writing scheduler scripts, packaging environments, and wiring up parallel job arrays can take significant refactoring	Managed ML platforms (Vertex AI, SageMaker, Azure ML) handle provisioning, parallelism, and teardown
Software environment	Module system; some clusters support containers — research computing staff can often help with setup	Prebuilt containers for common ML frameworks (PyTorch, TensorFlow, XGBoost); bring your own Docker image for full control

The short version: use your university cluster when it has the hardware you need and the queue isn’t blocking you. Use the cloud when you need hardware your cluster doesn’t have, need to scale beyond what the queue allows, or need a specific software environment you can’t easily get on-campus. Many researchers use both — develop and test on HPC, then scale to cloud for large experiments or specialized hardware.

When does model size justify cloud compute?

Not every model needs cloud hardware. Here’s a rough guide:

Model scale	Parameters	Example models	Where to run
Small	< 10M	Logistic regression, small CNNs, XGBoost	Laptop — HPC cloud adds overhead without much benefit
Medium	10M–500M	ResNets, BERT-base, mid-sized transformers	HPC with a single GPU (RTX 2080 Ti, L40) or cloud (T4, L4)
Large	500M–10B	GPT-2, LLaMA-7B, fine-tuning large transformers	HPC with A100 (40/80 GB) or cloud — both work well
Very large	10B–70B	LLaMA-70B, Mixtral	HPC with H100/H200 (80–141 GB) or cloud
Frontier	70B+	GPT-4-scale, multi-expert models	Cloud — requires multi-node NVLink clusters beyond what most HPC queues offer

CHTC’s GPU Lab covers more than you might think. The GPU Lab includes A100s (40 and 80 GB), H100s (80 GB), and H200s (141 GB) — enough VRAM to run inference on models up to ~70B parameters with quantization, or to fine-tune smaller models on a single high-memory GPU. For many UW researchers, this hardware handles “large model” workloads without needing cloud. Note that CHTC GPUs are not NVLink-connected, so multi-GPU parallelism is limited to methods that don’t require fast inter-GPU communication. Jobs have time limits (12 hrs for short, 24 hrs for medium, 7 days for long jobs), so plan your runs accordingly.

Cloud becomes the clear choice when you need NVLink multi-GPU or multi-node setups for frontier-scale training or inference, long-running services like RAG applications or model endpoints that need to stay up beyond HPC job time limits, or when queue wait times are blocking a deadline.

LLM APIs: skip the infrastructure entirely

For many GenAI tasks, you don’t need to provision GPUs at all. Services like the OpenAI API, Google’s Vertex AI, and Amazon Bedrock let you call frontier models (GPT-4o, Gemini, Claude, etc.) with a simple API request — no GPU provisioning, no model hosting. LLM API calls cost fractions of a cent each and are often the fastest, most cost-effective path. See GenAI at UW-Madison for available services.

A note on cloud costs

Cloud computing is not free, but it’s worth putting costs in context:

Hardware is expensive and ages fast. A single A100 GPU costs ~$15,000 and is outdated within a few years. Cloud lets you rent the latest hardware by the hour.
You pay only for what you use. Stop a VM and the meter stops — valuable for bursty research workloads. A single T4 GPU instance runs ~$1–3/hr. Fine-tuning a small model on a moderate dataset might cost $10–50.
Managed services save development time. You don’t have to write scheduling logic, package custom containers, or maintain orchestration infrastructure — managed ML platforms handle that plumbing so you can focus on the ML.
Budgets and alerts keep you safe. All three platforms offer billing dashboards and budget alerts to prevent surprise bills.

The key habit: choose the right machine size, stop resources when idle, and monitor spending.

Note

Cloud isn’t the right fit for every workload. If you want to avoid cloud costs, UW’s CHTC offers free GPU access for batch jobs (though jobs are queued and have runtime limits). Many researchers use a mix of both.

Tip

There is a learning curve, as with any new tool. But UW-developed workshop materials are available to help you get started — see the Related resources at the bottom of this page.

Why use a UW-provisioned account?

A self-provisioned cloud account (e.g., one you create directly with Google or AWS) is a personal agreement between you and the vendor — it is not covered by UW-Madison’s institutional contracts. By going through the UW Public Cloud Team, you get:

Negotiated pricing: UW contracts leverage Internet2 NET+ agreements and institutional reseller rates. For example, GCP accounts include a network egress waiver (up to 15% of your total bill), and Azure accounts receive ~3.5% off retail pricing.
Lower overhead on grants: Normally, UW adds 55.5% in overhead (F&A) to cloud expenses on grants. With a UW cloud account, that drops to 26% — so for every $10,000 you spend on cloud computing, you save about $2,950 in overhead. See the Cloud Computing Pilot for details.
NIH STRIDES discounts: NIH-funded researchers get additional cloud pricing discounts (on top of the UW contract rates) through the STRIDES Initiative. The UW cloud team can transition you into or out of STRIDES at any time — no data migration needed.
Business Associates Agreement (BAA): UW’s contracts include a BAA that governs vendor access to your data, which is critical for HIPAA-regulated health data.
Security monitoring: UW accounts benefit from Security Command Center monitoring with alerts escalated to the UW Cybersecurity Operations Team (CSOC).
Baseline security configuration: Accounts come pre-configured to meet CIS benchmark standards with NetID authentication built in.
Dedicated support: Get help from the DoIT Cloud Team via email (cloud-services@cio.wisc.edu), office hours, and in-person/video consultations.

For the full breakdown, see Why Should I Use a UW Madison Public Cloud Account? on the UW KnowledgeBase.

Paying for cloud compute with grant money

If you’re using grant funding to pay for cloud compute — from NIH, NSF, DOE, or any other sponsor — a UW-provisioned account can significantly reduce what your grant actually pays.

Lower overhead (Cloud Computing Pilot)

UW-Madison normally adds 55.5% in overhead (formally called “F&A” or “facilities & administrative costs”) to cloud expenses on grants. The Cloud Computing Pilot cuts that to 26% when you use a UW-provisioned cloud account. In practice, that means for every $10,000 in cloud spending, you’ll pay ~$2,600 in overhead instead of ~$5,550 — a savings of about $2,950.

Applies to new proposals and awards (including new funding increments).
You must use a UW cloud account — costs paid via purchasing card or personal accounts are charged the full 55.5%.
RSP provides budget templates to help you plan proposals with the reduced rate.
Contact RSP with questions about grant compliance.

NIH STRIDES Initiative

If you have NIH funding specifically, you can get additional cloud discounts on top of the standard UW rates through the STRIDES Initiative. STRIDES covers AWS, GCP, and Azure:

Discounted pricing on cloud services, layered on top of UW’s institutional rates.
Professional service consultations and technical support from STRIDES partners.
No data or configuration changes needed — the UW cloud team can transition you in or out at any time.

How to request a UW cloud account

To get started with any of the three platforms:

Get a DoIT Billing Customer ID — you’ll need this to tie your cloud usage to a funding source.
Fill out the UW-Madison Cloud Account Request Form — this covers AWS, GCP, and Azure. Indicate your intended data types and use case.
For sensitive/restricted data — you must complete a Cybersecurity risk assessment before processing HIPAA, FERPA, or other regulated data in the cloud.

Platform-specific details:

Research credits & training

Research credits

All three cloud providers offer credit programs for academic researchers:

Platform	Program	Amount	Eligibility
GCP	Cloud Research Credits	Up to $5,000 (faculty/postdocs); $1,000 (PhD students)	Faculty, postdocs, non-profit researchers, PhD students
AWS	Cloud Credit for Research	Varies by proposal	Researchers at accredited institutions; students may receive up to $5,000
Azure	Azure for Research	Varies by proposal	Faculty, researchers, and graduate students at accredited institutions
Azure	Azure Quantum Credits	Up to $10,000	Project-by-project basis; evaluated on research, educational, or commercial value

All three programs are rolling applications. You’ll need a research proposal describing your intended cloud usage and the specific services you plan to use.

Free cloud training

Each platform offers free, self-paced training to help you get started:

GCP: UW-Madison has a limited number of seats for Google Cloud Skills Boost — reach out to the Public Cloud Team at cloud-services@cio.wisc.edu to request access.
AWS: AWS Skill Builder offers 600+ free courses covering compute, ML, and more.
Azure: Microsoft Learn provides free, structured learning paths for Azure services.

Data protection & compliance

UW-Madison classifies institutional data into four risk categories: Restricted, Sensitive, Internal, and Public. Cloud eligibility depends on data classification:

Data type	Cloud eligible?	Requirements
Public / Internal	Yes	Standard UW cloud account
Sensitive	Yes, with assessment	Cybersecurity risk assessment required
Restricted (HIPAA, etc.)	Yes, with assessment	Risk assessment + risk executive approval + HIPAA-eligible services

Key compliance resources:

Data classification policy
Data elements allowed in public cloud
GCP for sensitive and restricted data
Shared responsibility model for cloud platforms
HIPAA Security Program
SMPH researchers using Azure: contact platformx-support@mailplus.wisc.edu about Platform X for HIPAA workloads.

Getting help

Office hours: The RCI and Public Cloud Team hold drop-in hours on Thursdays, 2–3:15 PM via Zoom. Open to the entire UW community.
Cloud Community: Join the UW Cloud Community group — they meet every other month to share cloud computing experiences and tips.
Email: cloud-services@cio.wisc.edu
Public Cloud KnowledgeBase: kb.wisc.edu — FAQs, pricing info, and how-to guides.
ML+X Community: Join ML+X for monthly meetings on machine learning and AI.

Related resources

Workshop/Compute: Intro to GCP for Machine Learning & AI – Hands-on workshop covering Vertex AI, model training/tuning, and RAG pipelines on GCP.
Workshop/Compute: Intro to AWS SageMaker for Predictive ML/AI – Workshop covering ML workflows in AWS SageMaker.
Compute: Google Colab – Free cloud-based Jupyter notebooks with GPU access.
Compute: Center for High Throughput Computing (CHTC) – Free on-campus HPC/HTC resources for UW researchers.
Compute: BadgerCompute – UW-Madison’s lightweight, NetID-authenticated Jupyter service.
GenAI: UW Generative AI Services & Policies – Overview of UW-vetted AI tools including pay-as-you-go cloud AI services.

Why run ML/AI in the cloud?

Cloud vs. university HPC clusters

When does model size justify cloud compute?

LLM APIs: skip the infrastructure entirely

A note on cloud costs

Why use a UW-provisioned account?

Paying for cloud compute with grant money

Lower overhead (Cloud Computing Pilot)

NIH STRIDES Initiative

How to request a UW cloud account

Research credits & training

Research credits

Grants for social impact & sustainability research

Free cloud training

Data protection & compliance

Getting help

Related resources

Comments