Intro to AWS SageMaker for Predictive ML/AI: Key Points

Overview of Amazon SageMaker

Use S3 for scalable, cost-effective, and flexible storage.
EC2 storage is fairly uncommon, but may be suitable for small, temporary datasets.
Track your S3 storage costs, data transfer, and requests to manage expenses.
Regularly delete unused data or buckets to avoid ongoing costs.

Use a minimal SageMaker notebook instance as a controller to manage larger, resource-intensive tasks.
Launch training and tuning jobs on scalable instances using the SageMaker SDK.
Tags can help track costs effectively, especially in multi-project or team settings.
Use the SageMaker SDK documentation to explore additional options for managing compute resources in AWS.

Use a GitHub PAT for HTTPS-based authentication in temporary SageMaker notebook instances.
Securely enter sensitive information in notebooks using getpass.
Converting .ipynb files to .py files helps with cleaner version control and easier review of changes.
Adding .ipynb files to .gitignore keeps your repository organized and reduces storage.

Environment initialization: Setting up a SageMaker session, defining roles, and specifying the S3 bucket are essential for managing data and running jobs in SageMaker.
Local vs. managed training: Always test your code locally (on a smaller scale) before scaling things up. This avoids wasting resources on buggy code that doesn’t produce reliable results.
Estimator classes: SageMaker provides framework-specific Estimator classes (e.g., XGBoost, PyTorch, SKLearn) to streamline training setups, each suited to different model types and workflows.
Custom scripts vs. built-in images: Custom training scripts offer flexibility with preprocessing and custom logic, while built-in images are optimized for rapid deployment and simpler setups.
Training data channels: Using TrainingInput ensures SageMaker manages data efficiently, especially for distributed setups where data needs to be synchronized across multiple instances.
Distributed training options: Data parallelism (splitting data across instances) is common for many models, while model parallelism (splitting the model across instances) is useful for very large models that exceed instance memory.

Efficient data handling: The .npz format is optimized for efficient loading, reducing I/O overhead and enabling batch compatibility for PyTorch’s DataLoader.
GPU training: While beneficial for larger models or datasets, GPUs may introduce overhead for smaller tasks; selecting the right instance type is critical for cost-efficiency.
Data parallelism vs. model parallelism: Data parallelism splits data across instances and synchronizes model weights, suitable for typical neural network tasks. Model parallelism, which divides model layers, is ideal for very large models that exceed memory capacity.
SageMaker configuration: By adjusting instance counts and types, SageMaker supports scalable training setups. Starting with CPU training and scaling as needed with GPUs or distributed setups allows for performance optimization.
Testing locally first: Before deploying large-scale training in SageMaker, test locally with a smaller setup to ensure code correctness and efficient resource usage.

Always stop or delete notebook instances when not in use to avoid charges.
Regularly clean up unused S3 buckets and objects to save on storage costs.
Monitor your expenses through the AWS Billing Dashboard and set up alerts.
Use tags (set up earlier in the workshop) to track and monitor costs by resource.
Following best practices for AWS resource management can significantly reduce costs and improve efficiency.