Here’s a step-by-step guide to set up a music generator with Hugging Face, starting from a short audio clip as input. I’ll chain the explanation all the way down to models, code, and internals.


1. Choose a Model

On Hugging Face, you’ll find two main families of models for music/audio generation:

  • AudioLM / MusicLM-like models (Google’s work, some community ports exist).
  • Diffusion-based models (e.g. suno/bark, facebook/musicgen).

👉 For music generation, the most stable option right now is Meta’s MusicGen, which Hugging Face hosts:

These can generate new music either from text prompts or by conditioning on a short audio sample (your clip).


2. Install Dependencies

First set up your Python environment:

pip install torch torchaudio
pip install transformers accelerate
pip install librosa soundfile

(Optional: if you want GPU acceleration, make sure you install CUDA-enabled PyTorch.)


3. Load the Model & Processor

from transformers import AutoProcessor, MusicgenForConditionalGeneration
import torchaudio

# load processor + model
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")

4. Prepare Your Input Clip

You mentioned starting with a short sample. MusicGen supports audio conditioning:

import librosa

# load your reference audio (e.g., 10-second WAV)
audio, sr = librosa.load("sample_clip.wav", sr=32000, mono=True)

# ensure shape: [batch, time]
import torch
audio_tensor = torch.tensor(audio).unsqueeze(0)

5. Run Music Generation

Now feed both text prompt and reference audio:

inputs = processor(
    audio=audio_tensor,
    sampling_rate=32000,
    text=["A lo-fi hip hop beat with relaxing vibes"],
    return_tensors="pt"
)

# Generate 20 seconds (20 * 32000 samples)
output = model.generate(**inputs, max_new_tokens=640000)

# save output
torchaudio.save("generated_music.wav", output[0].cpu(), 32000)

6. How It Works Internally

  • MusicgenForConditionalGeneration wraps a Transformer decoder trained on 400k hours of licensed music.
  • The processor converts raw audio → discrete tokens (quantized via EnCodec).
  • When you pass a clip, it conditions the generation on those audio tokens, blending with your text prompt.
  • generate() calls Hugging Face’s Greedy/Beam Search samplers.
  • Tokens are then decoded back into waveform by EnCodec.

The source code for MusicgenForConditionalGeneration lives here:


7. Quick Web Demo (Optional)

If you want to try in the browser without coding, Hugging Face Spaces hosts ready-made demos: 👉 MusicGen Space


✅ With this setup, you can:

  • Generate music purely from text.
  • Use your short audio clip as a “style transfer” seed.
  • Mix both to extend the clip or create variations.

Do you want me to also show how to train/fine-tune your own MusicGen model on custom clips (so it mimics your dataset’s style), or do you want to stick with inference only?