Summary and Setup

Workshop Overview

This workshop introduces you to foundational workflows in Amazon SageMaker, covering data setup, code repo setup, model training, and hyperparameter tuning within AWS’s managed environment. You’ll learn how to use SageMaker notebooks to control data pipelines, manage training and tuning jobs, and evaluate model performance effectively. We’ll also cover strategies to help you scale training and tuning efficiently, with guidance on choosing between CPUs and GPUs, as well as when to consider parallelized workflows (i.e., using multiple instances).

To keep costs manageable, this workshop provides tips for tracking and monitoring AWS expenses, so your experiments remain affordable. While AWS isn’t entirely free, it’s very cost-effective for typical ML workflows—training roughly 100 models on a small dataset (under 10GB) can cost under $20, making it accessible for many research projects.

What This Workshop Does Not Cover (Yet)

Currently, this workshop does not include:

AWS Lambda for serverless function deployment,
MLFlow or other MLOps tools for experiment tracking,
Additional AWS services beyond the core SageMaker ML workflows.

If there’s a specific ML workflow or AWS service you’d like to see included in this curriculum, we’re open to developing more content to meet the needs of researchers and ML practitioners at UW–Madison (and at other researcher institutes). Please contact endemann@wisc.edu with suggestions or requests.

Accounts and Initial Setup

GitHub Account

If you don’t already have a GitHub account, sign up for GitHub to create a free account. A GitHub account will be required to fork and interact with the lesson repository.

AWS Account

If you don’t have an AWS account, please follow these steps:

Note: Hackathon attendees can skip this step since we are providing you with the account.

Go to the AWS Free Tier page and click Create a Free Account.
Complete the sign-up process. AWS offers a free tier with limited monthly usage. Some services, including SageMaker, may incur charges beyond free-tier limits, so be mindful of usage during the workshop. If you follow along with the materials, you can expect to incur around $10 in compute fees (e.g., from training and tuning several different models with GPU enabled at times).

Once your AWS account is set up, log in to the AWS Management Console to get started with SageMaker.

Data Sets

For this workshop, you will need the Titanic dataset. Please download the following files by right clicking each and selecting Save as. Make sure to save them out as .csvs:

Save these files to a location where they can easily be accessed. In the first episode, you will create an S3 bucket and upload this data to use with SageMaker.

Workshop Repository Setup

You will need a copy of our AWS_helpers repo on GitHub to explore how to manage your repo in AWS. This setup will allow you to follow along with the workshop and test out the Interacting with Repositories episode.

To do this:

Go to the AWS_helpers GitHub repository.
Click Fork (top right) to create your own copy of the repository under your GitHub account.
Once forked, you don’t need to do anything else. We’ll clone this fork once we start working in the AWS Jupyter environment using…

PYTHON

!git clone https://github.com/YOUR_USERNAME/AWS_helpers.git