National Research Platform (NRP)

Compute
Notebooks
Code-along
GPU
LLM
API
Kubernetes
Jupyter
Education
NSF
Free open-weight LLM endpoints plus free Kubernetes GPU compute for InCommon-affiliated researchers. Reference + Python code-along for hitting the NRP-hosted LLM API.
Author

Chris Endemann

Published

May 13, 2026

Free frontier-LLM endpoints for researchers — plus free GPU compute on a multi-site Kubernetes cluster. The National Research Platform (NRP) is NSF-supported and open at no cost to researchers and educators at InCommon-affiliated institutions (UW–Madison is in). Usage is governed by the Acceptable Use Policy and per-user fair-use limits.

For ML/AI work, NRP gives you:

  1. Hosted LLM endpoints — call open-weight models through an OpenAI-compatible API at https://ellm.nrp-nautilus.io/v1. No GPU on your side; NRP runs the inference. See the model table below.
  2. General compute for everything else — your own GPU pods (RTX 2080 Ti / 3090 / 4090, A10, A100) for training, fine-tuning, or non-LLM work. Most users go through the hosted JupyterHub or Coder (VS Code in browser); classrooms can request private JupyterHub deployments via support@nationalresearchplatform.org.
WarningNRP is not for sensitive or restricted data

NRP Nautilus has no HIPAA-eligible storage and the Acceptable Use Policy prohibits processing HIPAA / PHI, PII, CUI, FERPA student records, or any data covered by statute or data-use agreement. This applies to both what you store on the cluster and what you send into the LLM endpoint — your prompts can be cached across users on the shared deployment.

If your work involves restricted data, route it through UW–Madison’s institutional resources instead:

Public and de-identified research data is fine on NRP.

Get started

  1. Sign in at nrp.ai via CILogon. Pick one identity provider and stick with it — authentik binds your account to the first IdP you use, and switching later gives a “Permission denied” error. I typically use the Google login option.
  2. Join Matrix (nrp.ai/contact) for support and account promotion. Admin requests, LLM-flag enablement, and most help happen in the Nautilus Support channel — DMs to admins are rejected.
  3. Get into a namespace — required for both LLM API and GPU/compute access (an account alone has no resources). Three patterns:
    • Solo researcher / faculty / postdoc: ask in Nautilus Support to be promoted to admin, then create your own namespace at nrp.ai/namespaces. NRP’s convention is <institution>-<group> (e.g., wisc-ml-marathon); admins may rename to fit. Mention LLM use in your request and they’ll enable the LLM flag at the same time.
    • Lab or research group: the PI becomes the namespace admin and adds members.
    • Student: ask your advisor to add you to their existing namespace.
  4. Generate an LLM token at nrp.ai/llmtoken. If you forgot to mention LLM use up front and your namespace isn’t flagged for it, ask in Nautilus Support — it’s a one-line toggle. See Calling the LLM endpoint from Python below for what to put in the Group / Alias fields and how to wire the token into a .env file.
  5. For general compute (anything beyond the LLM endpoint — training, fine-tuning, your own notebooks): most users go through the hosted JupyterHub or Coder (browser-based, no install). Only if you want to write Kubernetes YAML directly: install kubectl + kubelogin and drop the NRP config at ~/.kube/config — full walkthrough in Getting Started.

Read the Acceptable Use Policy and Cluster Policies before submitting workloads — there are real resource-utilization rules and you can be banned for violating them.

Available LLMs and fair-use limits

The catalog rotates as the open-weights frontier moves (main = generally supported, evaluating = may change). Full per-model cards on the models page; see lifecycle & changelog before pinning a model name in production scripts.

Model Status Params Context Modalities Max concurrent
qwen3 main 397B 1.01M image, video 16
qwen3-small main 27B 1.01M image, video 8
gpt-oss main 120B 131K text 16
gemma main 31B 262K image, video 8
qwen3-embedding main 8B image, video 16
gemma-small evaluating ~8B 131K image, video, audio 8
kimi evaluating 1T 262K image, video 2
glm-5 evaluating 744B 203K text 4
minimax-m2 evaluating 230B 205K text 8
olmo evaluating 32B 65K text 8

The Max concurrent column is transposed directly from NRP’s Fair Use Policy, which publishes the same numbers in the inverse layout (concurrency → list of models). “Max concurrent” = how many requests you can have open at the same instant under one access token. Not per minute, not per day — just simultaneously. A single-threaded script that sends one request, waits, then sends the next has concurrency 1 and never hits the limit, however long it runs. Parallel/asyncio code that fires N requests at once does. There is no published requests-per-minute, tokens-per-minute, or daily token cap — concurrency is the only quantified throttle.

Requests using ≥35% of a model’s context drop to 1 concurrent request, and the sum of in-flight context must stay under 35%.

If you need to go beyond these limits in exceptional circumstances (deadlines, workshops, other high-volume needs), you must contact admins in advance in the Nautilus Support channel to arrange a session. Otherwise, NRP recommends deploying and running your own LLM on the cluster via vLLM or SGLang.

Point any OpenAI-compatible client at NRP’s API with your token: desktop apps (Chatbox, Cherry Studio), coding CLIs (Claude Code, OpenCode, Crush, Kimi CLI, Copilot CLI), or Python — see client configurations and API access docs. NRP also hosts browser chat UIs (Open WebUI, LibreChat), though as of May 2026 they’ve been intermittently unreliable.

Calling the LLM endpoint from Python

A short code-along. To run it yourself, you’ll need:

  1. An LLM token. Visit nrp.ai/llmtoken (you need to be in an LLM-flagged namespace — see Get started above). Two fields:

    • Group: dropdown of your LLM-flagged namespaces — pick the one this token should belong to (e.g., wisc-ml-marathon). Tokens are scoped to one namespace; if you later need one for a different group, you mint a new token.
    • Alias: free-text label for your own bookkeeping — NRP doesn’t validate it or show it anywhere else. lastname-dev is a good default (e.g., endemann-dev); use something more specific (e.g., endemann-laptop, endemann-nexus-demo) if you’ll have multiple tokens you want to tell apart and revoke independently.

    Click to mint. The page shows a long opaque string — copy it and treat it like a password. Don’t paste it into source code or commit it.

  2. A way to load the token into os.environ["NRP_LLM_TOKEN"] without typing it into a notebook cell that gets saved or shared. Two paths:

    • Local Jupyter / VS Code / etc. — Create a .env file in the same directory as this notebook:

      # .env  (do not commit this file)
      NRP_LLM_TOKEN=paste-your-long-token-string-here

      Add .env to .gitignore. The python-dotenv package in the install line below reads .env automatically into os.environ when you call load_dotenv().

    • Google Colab — Don’t paste the token into a cell; Colab autosaves to Drive in real time and your token would ride along. Use Colab’s built-in Secrets panel instead:

      1. Open the notebook (the Secrets panel only appears once a notebook is loaded, not on the Colab home screen).
      2. In the far-left vertical icon strip, click the 🔑 key icon (it’s the literal key shape — not the hamburger, which is Table of Contents).
      3. Click Add new secret, name it NRP_LLM_TOKEN, paste the token value, and toggle Notebook access on.

      Then swap the dotenv block in the install cell for:

      from google.colab import userdata
      os.environ["NRP_LLM_TOKEN"] = userdata.get("NRP_LLM_TOKEN").strip()

      The secret persists across Colab sessions, never appears in notebook output, and isn’t shared if you share the notebook.

Install and configure the client.

# pip install --quiet openai python-dotenv requests
import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI(
    api_key=os.environ["NRP_LLM_TOKEN"].strip(),  # .strip() avoids "Illegal header value" if a \r\n snuck in during paste
    base_url="https://ellm.nrp-nautilus.io/v1",
)

List available models. Expect qwen3, gpt-oss, kimi, gemma, qwen3-embedding, etc. — the live set rotates.

for m in client.models.list().data:
    print(m.id)

A chat completion. gpt-oss is a good first call — text-only, low GPU footprint, stable across model updates.

completion = client.chat.completions.create(
    model="gpt-oss",
    messages=[
        {"role": "system", "content": "You are a concise teaching assistant for a graduate ML course."},
        {"role": "user", "content": "In 2-3 sentences, when should I use LoRA instead of full fine-tuning?"},
    ],
)
print(completion.choices[0].message.content)

If your client needs max_tokens, set it to ~1/3–1/4 of the model’s context — it caps output, not total context.

Toggle reasoning mode. Several models default to “thinking” mode, which adds latency. Disable via extra_body:

completion = client.chat.completions.create(
    model="qwen3-small",
    messages=[{"role": "user", "content": "One-line definition of overfitting."}],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(completion.choices[0].message.content)

Exact keys vary by model — see each model card for glm-5, kimi, etc.

Multimodal image input. gemma, qwen3, qwen3-small, gemma-small, and kimi accept images as image_url content blocks (URL or base64 data URI). gemma-small is the only catalogued model that also accepts audio (ASR / speech-to-text).

import base64, requests
from IPython.display import Image, display

# Wikimedia blocks the default `python-requests` User-Agent and returns HTML
# instead of the image. Pass any descriptive UA and it serves the JPEG.
img = requests.get(
    "https://upload.wikimedia.org/wikipedia/commons/3/3a/Cat03.jpg",
    timeout=30,
    headers={"User-Agent": "Nexus-NRP-card/1.0 (https://uw-madison-datascience.github.io/ML-X-Nexus/)"},
).content

display(Image(data=img, width=320))  # preview the image inline before sending it

b64 = base64.b64encode(img).decode("ascii")
completion = client.chat.completions.create(
    model="gemma",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "In one sentence: what animal and what is it doing?"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
        ],
    }],
)
print(completion.choices[0].message.content)

Audio input (ASR / speech-to-text). gemma-small is the only catalogued model that accepts audio. Toy example using a short public-domain MLK clip from a Hugging Face fixture dataset:

audio = requests.get(
    "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac",
    timeout=30,
).content
b64 = base64.b64encode(audio).decode("ascii")

completion = client.chat.completions.create(
    model="gemma-small",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Transcribe the speech in this clip verbatim."},
            {"type": "audio_url", "audio_url": {"url": f"data:audio/flac;base64,{b64}"}},
        ],
    }],
)
print(completion.choices[0].message.content)

If gemma-small rejects FLAC on your account, try a WAV or MP3 source — vLLM’s audio support varies by build. The OpenAI-compatible content-block shape stays the same; only the data:audio/<fmt>;base64,... MIME hint changes.

Embeddings for RAG. qwen3-embedding is the embedding model; do not call it for chat completions.

emb = client.embeddings.create(
    model="qwen3-embedding",
    input=[
        "Retrieval-augmented generation pairs a vector store with an LLM.",
        "LoRA inserts low-rank adapter matrices into a frozen base model.",
    ],
)
print(f"dim={len(emb.data[0].embedding)}, first 5={emb.data[0].embedding[:5]}")

Vectors plug into any vector DB (Chroma, Qdrant, pgvector, etc.).

Cache isolation for sensitive prompts.

NRP’s gateway caches prompt → response pairs across everyone who hits the same model. That’s normally a speed win — if a second user sends the literal same prompt as the first, they get the first’s response instantly, no GPU time spent. The flip side: if your prompt contains private content ("Summarize this internal memo: ..."), an unrelated user who happens to type the same prompt could pull your cached response back out.

A cache_salt is a secret random string you mix into every request so your cache entries are scoped to you. Other users sending the same prompt won’t have your salt, so they don’t match your cache (and you don’t match theirs).

Treat it like a second API key. Generate once per project (NRP recommends ≥256 bits of entropy — i.e. 32 random bytes, base64-encoded). Save it the same way you saved NRP_LLM_TOKEN. Reuse the same value across runs so your own repeated prompts still get cache hits from you yesterday.

Generate the salt one time (run this in a throwaway cell or terminal and copy the output):

import base64, os
print(base64.b64encode(os.urandom(32)).decode("ascii"))
# e.g. -> 'YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXphYmNkZWZnaGlqa2xtbm9wcQ=='

Add it alongside the token in your .env (or Colab Secrets):

NRP_LLM_TOKEN=...
NRP_CACHE_SALT=YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXphYmNkZWZnaGlqa2xtbm9wcQ==

Then pass it on each request:

completion = client.chat.completions.create(
    model="gpt-oss",
    messages=[{"role": "user", "content": "Summarize this internal memo: ..."}],
    extra_body={"cache_salt": os.environ["NRP_CACHE_SALT"]},
)
print(completion.choices[0].message.content)

Cluster compute essentials

NRP is Kubernetes — workloads are described in YAML. Key rules:

  • Containers are stateless. Use persistent volumes for anything you can’t lose.
  • Interactive pods: 6 hr max, capped at 2 GPUs / 32 GB RAM / 16 CPU (see Cluster Policies for the current numbers).
  • Batch work goes in a Job. sleep infinity is bannable.
  • Long-running services use a Deployment (no GPU), auto-deleted after 2 weeks.
  • GPUs must run >40% utilized. A100 quota is 0 by default and requires the A100 access request.
  • Storage idle 6+ months is purged.

How NRP fits alongside other UW–Madison resources

For UW–Madison researchers and teams building LLM- or VLM-based applications, NRP complements rather than replaces other UW options. Two ways to think about the choice:

Hosted LLM endpoints (call the model via API)

  • NRP — free, open-weight models (Qwen3, GPT-OSS, Kimi, etc.) via OpenAI-compatible endpoint. Per-user concurrency limits; no service-level agreement; best-effort capacity.
  • UW Cloud Services — proprietary frontier via Bedrock / Gemini API / Azure OpenAI; pay-per-token, fastest latency at a fee. UW-provisioned accounts cut grant overhead from 55.5% to 26%, get Internet2 NET+ pricing, and are HIPAA-eligible.
  • GenAI at UW–Madison — UW-vetted free chat UIs (Google Gemini, Microsoft 365 Copilot, NotebookLM). End-user tools, not programmatic APIs — useful for ad-hoc analysis, not for building applications.

General compute (run your own code on GPUs)

  • NRP — hosted JupyterHub, Coder, and self-managed GPU pods (RTX 2080 Ti / 3090 / 4090, A10, A100) on a free, multi-site Kubernetes cluster.
  • CHTC — free GPU Lab (A100s, H100s, H200s) on a traditional high-throughput computing model: borrow GPUs for limited-duration compute, with batching for jobs that need to run longer than a single allocation. Good for training, fine-tuning, parameter sweeps, and one-off inference up to ~70B with quantization on a single 80 GB GPU. Not a model-serving platform; no NVLink multi-GPU.
  • UW Cloud Services — hyperscaler hardware (B200s, NVLink multi-GPU, world-class latency) and managed ML platforms (SageMaker, Vertex AI, Azure ML) for notebooks and training. Also the route for persistent endpoints that stay up beyond HPC job time limits. Same UW-account pricing benefits as above.

Getting help

Comments