Here’s a step-by-step guide to set up a music generator with Hugging Face, starting from a short audio clip as input. I’ll chain the explanation all the way down to models, code, and internals.
1. Choose a Model
On Hugging Face, you’ll find two main families of models for music/audio generation:
- AudioLM / MusicLM-like models (Google’s work, some community ports exist).
- Diffusion-based models (e.g.
suno/bark
,facebook/musicgen
).
👉 For music generation, the most stable option right now is Meta’s MusicGen, which Hugging Face hosts:
These can generate new music either from text prompts or by conditioning on a short audio sample (your clip).
2. Install Dependencies
First set up your Python environment:
pip install torch torchaudio
pip install transformers accelerate
pip install librosa soundfile
(Optional: if you want GPU acceleration, make sure you install CUDA-enabled PyTorch.)
3. Load the Model & Processor
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import torchaudio
# load processor + model
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
4. Prepare Your Input Clip
You mentioned starting with a short sample. MusicGen supports audio conditioning:
import librosa
# load your reference audio (e.g., 10-second WAV)
audio, sr = librosa.load("sample_clip.wav", sr=32000, mono=True)
# ensure shape: [batch, time]
import torch
audio_tensor = torch.tensor(audio).unsqueeze(0)
5. Run Music Generation
Now feed both text prompt and reference audio:
inputs = processor(
audio=audio_tensor,
sampling_rate=32000,
text=["A lo-fi hip hop beat with relaxing vibes"],
return_tensors="pt"
)
# Generate 20 seconds (20 * 32000 samples)
output = model.generate(**inputs, max_new_tokens=640000)
# save output
torchaudio.save("generated_music.wav", output[0].cpu(), 32000)
6. How It Works Internally
MusicgenForConditionalGeneration
wraps a Transformer decoder trained on 400k hours of licensed music.- The processor converts raw audio → discrete tokens (quantized via EnCodec).
- When you pass a clip, it conditions the generation on those audio tokens, blending with your text prompt.
generate()
calls Hugging Face’s Greedy/Beam Search samplers.- Tokens are then decoded back into waveform by EnCodec.
The source code for MusicgenForConditionalGeneration
lives here:
- Hugging Face Transformers:
src/transformers/models/musicgen
- EnCodec tokenizer:
facebookresearch/encodec
7. Quick Web Demo (Optional)
If you want to try in the browser without coding, Hugging Face Spaces hosts ready-made demos: 👉 MusicGen Space
✅ With this setup, you can:
- Generate music purely from text.
- Use your short audio clip as a “style transfer” seed.
- Mix both to extend the clip or create variations.
Do you want me to also show how to train/fine-tune your own MusicGen model on custom clips (so it mimics your dataset’s style), or do you want to stick with inference only?