Overview of Amazon SageMaker
Data Storage: Setting up S3
- Use S3 for scalable, cost-effective, and flexible storage.
- EC2 storage is fairly uncommon, but may be suitable for small, temporary datasets.
- Track your S3 storage costs, data transfer, and requests to manage expenses.
- Regularly delete unused data or buckets to avoid ongoing costs.
Notebooks as Controllers
- Use a minimal SageMaker notebook instance as a controller to manage larger, resource-intensive tasks.
- Launch training and tuning jobs on scalable instances using the SageMaker SDK.
- Tags can help track costs effectively, especially in multi-project or team settings.
- Use the SageMaker SDK documentation to explore additional options for managing compute resources in AWS.
Accessing and Managing Data in S3 with SageMaker Notebooks
- Load data from S3 into memory for efficient storage and processing.
- Periodically check storage usage and costs to manage S3 budgets.
- Use SageMaker to upload analysis results and maintain an organized workflow.
Using a GitHub Personal Access Token (PAT) to Push/Pull from a SageMaker Notebook
- Use a GitHub PAT for HTTPS-based authentication in temporary SageMaker notebook instances.
- Securely enter sensitive information in notebooks using
getpass
. - Converting
.ipynb
files to.py
files helps with cleaner version control and easier review of changes. - Adding
.ipynb
files to.gitignore
keeps your repository organized and reduces storage.
Training Models in SageMaker: Intro
- Environment initialization: Setting up a SageMaker session, defining roles, and specifying the S3 bucket are essential for managing data and running jobs in SageMaker.
- Local vs. managed training: Always test your code locally (on a smaller scale) before scaling things up. This avoids wasting resources on buggy code that doesn’t produce reliable results.
- Estimator classes: SageMaker provides framework-specific Estimator classes (e.g., XGBoost, PyTorch, SKLearn) to streamline training setups, each suited to different model types and workflows.
- Custom scripts vs. built-in images: Custom training scripts offer flexibility with preprocessing and custom logic, while built-in images are optimized for rapid deployment and simpler setups.
-
Training data channels: Using
TrainingInput
ensures SageMaker manages data efficiently, especially for distributed setups where data needs to be synchronized across multiple instances. - Distributed training options: Data parallelism (splitting data across instances) is common for many models, while model parallelism (splitting the model across instances) is useful for very large models that exceed instance memory.
Training Models in SageMaker: PyTorch Example
-
Efficient data handling: The
.npz
format is optimized for efficient loading, reducing I/O overhead and enabling batch compatibility for PyTorch’sDataLoader
. - GPU training: While beneficial for larger models or datasets, GPUs may introduce overhead for smaller tasks; selecting the right instance type is critical for cost-efficiency.
- Data parallelism vs. model parallelism: Data parallelism splits data across instances and synchronizes model weights, suitable for typical neural network tasks. Model parallelism, which divides model layers, is ideal for very large models that exceed memory capacity.
- SageMaker configuration: By adjusting instance counts and types, SageMaker supports scalable training setups. Starting with CPU training and scaling as needed with GPUs or distributed setups allows for performance optimization.
- Testing locally first: Before deploying large-scale training in SageMaker, test locally with a smaller setup to ensure code correctness and efficient resource usage.
Hyperparameter Tuning in SageMaker: Neural Network Example
Resource Management and Monitoring
- Always stop or delete notebook instances when not in use to avoid charges.
- Regularly clean up unused S3 buckets and objects to save on storage costs.
- Monitor your expenses through the AWS Billing Dashboard and set up alerts.
- Use tags (set up earlier in the workshop) to track and monitor costs by resource.
- Following best practices for AWS resource management can significantly reduce costs and improve efficiency.