TorchAudio
About this resource
The TorchAudio library is an audio library that allows you to incorporate modern audio signal processing into deep learning workflows. Developed by the PyTorch team, it offers GPU-friendly tools for audio I/O, feature extraction, and augmentation. TorchAudio’s I/O means it can decode files like WAV/MP3/FLAC into PyTorch tensors of shape (channels, time) with torchaudio.load and then write tensors back to audio with torchaudio.save. Furthermore, TorchAudio provides differentiable transforms (STFT, Mel/CQT spectrograms, MFCC) and SoX-based effects for augmentation (pitch/tempo changes, masking). Built on PyTorch, it slots cleanly into DataLoader and nn.Module pipelines, making it perfect for researchers and practitioners building speech recognition, music transcription, and other audio ML systems.
Key features
- Feature 1: Tensor-first transforms
- MelSpectrogram, CQT, MFCC, Resample, AmplitudeToDB.
- Feature 2: Audio I/O
- Load/save WAV/MP3/FLAC straight to torch.Tensor (which is CPU/GPU-ready).
- Feature 3: Augmentation
- Pitch/tempo changes, masking, noise via SoX effects.
- Performance: Offers batches and GPU acceleration through PyTorch and it also works well with DataLoader.
Integration and compatibility
TorchAudio integrates with various machine learning frameworks and libraries, making it versatile for a range of tasks.
- Frameworks Supported: PyTorch
- Compatible Libraries: NumPy, SciPy, librosa (complementary analysis), pretty_midi (export MIDI)
- Installation Instructions: ‘pip install torchaudio’
Use cases
Here are some examples of how TorchAudio can be applied to different machine learning tasks.
- Use Case 1: Wav file to midi transcription
- Preprocess audio to log-mel/CQT tensors, train CNN/CRNN models for frame-wise notes/onsets.
- Use Case 2: Data augmentation
- Pitch/tempo shifts to increase robustness.
Tutorials and resources
Getting started
code snippet (this code loads audio as a tensor and resamples it so every file has the same sample rate):
# wav -> log-mel spectrogram tensor (which would be model input) import torchaudio = torchaudio.load("path/to/audio.wav") w, sr = w.mean(0, keepdim=True) if w.size(0) > 1 else w w if sr != 22050: w = torchaudio.transforms.Resample(sr, 22050)(w); sr = 22050 = torchaudio.transforms.MelSpectrogram(sample_rate=sr, n_fft=2048, hop_length=512, n_mels=128) mel = torchaudio.transforms.AmplitudeToDB(stype="power")(mel(w)) X_db print(X_db.shape) # shape would be (1, 128, T)
High-level tips for effective use
- Optimization: precompute log-mels for quicker training
- Memory Management: use small-ish n_mels and hop_length and also batch by time frames
- Common Pitfalls: a common mistake is having inconsistent sample rates and hop lengths, so you should make sure to keep them the same in training and inference
Questions?
If you have any lingering questions about this resource, feel free to post them on the ML+X Nexus Q&A on GitHub.
See also
- Documentation: TorchAudio Documentation: Includes official API as well as tutorials.
- Youtube Tutorial: Getting Started With Torchaudio | PyTorch Tutorial: In this youtube video from the AssemblyAI youtube channel, you can learn how to code the basic features of TorchAudio, including resampling and incorporating an audio dataset.
- Dataset: The MAESTRO Dataset: Popular dataset containing hundreds of paired audio and MIDI recordings that can be processed with TorchAudio and used for training.