TorchAudio – Nexus: Crowdsourced ML Resources

About this resource

The TorchAudio library is an audio library that allows you to incorporate modern audio signal processing into deep learning workflows. Developed by the PyTorch team, it offers GPU-friendly tools for audio I/O, feature extraction, and augmentation. TorchAudio’s I/O means it can decode files like WAV/MP3/FLAC into PyTorch tensors of shape (channels, time) with torchaudio.load and then write tensors back to audio with torchaudio.save. Furthermore, TorchAudio provides differentiable transforms (STFT, Mel/CQT spectrograms, MFCC) and SoX-based effects for augmentation (pitch/tempo changes, masking). Built on PyTorch, it slots cleanly into DataLoader and nn.Module pipelines, making it perfect for researchers and practitioners building speech recognition, music transcription, and other audio ML systems.

Key features

Feature 1: Tensor-first transforms
- MelSpectrogram, CQT, MFCC, Resample, AmplitudeToDB.
Feature 2: Audio I/O
- Load/save WAV/MP3/FLAC straight to torch.Tensor (which is CPU/GPU-ready).
Feature 3: Augmentation
- Pitch/tempo changes, masking, noise via SoX effects.
Performance: Offers batches and GPU acceleration through PyTorch and it also works well with DataLoader.

Integration and compatibility

TorchAudio integrates with various machine learning frameworks and libraries, making it versatile for a range of tasks.

Frameworks Supported: PyTorch
Compatible Libraries: NumPy, SciPy, librosa (complementary analysis), pretty_midi (export MIDI)
Installation Instructions: ‘pip install torchaudio’

Use cases

Here are some examples of how TorchAudio can be applied to different machine learning tasks.

Use Case 1: Wav file to midi transcription
- Preprocess audio to log-mel/CQT tensors, train CNN/CRNN models for frame-wise notes/onsets.
Use Case 2: Data augmentation
- Pitch/tempo shifts to increase robustness.

Tutorials and resources

Getting started

Official Tutorial

code snippet (this code loads audio as a tensor and resamples it so every file has the same sample rate):

# wav -> log-mel spectrogram tensor (which would be model input)
import torchaudio

w, sr = torchaudio.load("path/to/audio.wav")
w = w.mean(0, keepdim=True) if w.size(0) > 1 else w
if sr != 22050: w = torchaudio.transforms.Resample(sr, 22050)(w); sr = 22050

mel = torchaudio.transforms.MelSpectrogram(sample_rate=sr, n_fft=2048, hop_length=512, n_mels=128)
X_db = torchaudio.transforms.AmplitudeToDB(stype="power")(mel(w))
print(X_db.shape)  # shape would be (1, 128, T)

High-level tips for effective use

Optimization: precompute log-mels for quicker training
Memory Management: use small-ish n_mels and hop_length and also batch by time frames
Common Pitfalls: a common mistake is having inconsistent sample rates and hop lengths, so you should make sure to keep them the same in training and inference

Questions?

If you have any lingering questions about this resource, feel free to post them on the ML+X Nexus Q&A on GitHub.