Generative Audio Jingle Creator

v20260501

audio-jingle

A comprehensive audio generation skill designed for professional multimedia content creation. It handles three types of audio output: music (jingles, beds), voiceovers (TTS), and sound effects (SFX). The skill intelligently routes requests to advanced models like Suno V5, Udio, MiniMax TTS, and ElevenLabs, producing a single, high-quality MP3 or WAV file ready for use in marketing, video, or podcast production.

Audio Generation Music Composition TTS Sound Design Media

Get Skill

146 downloads

Overview

Audio Jingle Skill

Three sub-modes. The active project's audioKind decides which one runs:

`audioKind`	Models we route to	Plan focus
`music`	Suno V5 (default), Udio, Lyria 2	genre + tempo + instrumentation
`speech`	MiniMax TTS (default), Fish, ElevenLabs V3	script + voice + pacing
`sfx`	ElevenLabs SFX (default), AudioCraft	texture + impact + duration

Resource map

audio-jingle/
├── SKILL.md
└── example.html

Workflow

Step 0 — Read the project metadata

audioKind, audioModel, audioDuration (seconds), and (for speech) voice. Branch by audioKind and use the values verbatim — no clarifying form unless something is marked (unknown — ask).

Important: voice is provider-specific. For minimax-tts, --voice must be a valid MiniMax voice_id (for example male-qn-qingse), not a natural-language description. If you only have a prose voice brief ("warm female narrator", "neutral Mandarin"), keep that in your plan but omit --voice so the daemon's default voice id applies, or ask the user to choose a specific id.

Step 1 — Plan

Music

Genre + reference artists (1-2)
Tempo (BPM) + key
Instrumentation (3-5 instruments max)
Vocals: yes / no / hummed / choir
Mood arc (intro → chorus → outro)

Speech

Script (final, not draft — TTS runs verbatim)
Voice target + pacing For MiniMax this means a real voice_id, not prose in --voice
Pronunciation hints for proper nouns / acronyms

SFX

Texture (impact / whoosh / ambience / foley)
Duration + envelope (sharp attack vs. gentle swell)
Layering note (single hit vs. stacked)

State the plan in 2-3 sentences before dispatching.

Step 2 — Compose the prompt

Use the format the upstream model prefers. Bind audioDuration to the API parameter directly; never put "make it 30 seconds" in prose.

Step 3 — Dispatch via the media contract

Use the unified dispatcher — do not call provider APIs by hand:

node "$OD_BIN" media generate \
  --project "$OD_PROJECT_ID" \
  --surface audio \
  --audio-kind "<music|speech|sfx>" \
  --model "<audioModel from metadata>" \
  --duration <audioDuration seconds> \
  [--voice "<provider voice id (speech only)>"] \
  --output "<short-slug>-<duration>s.mp3" \
  --prompt "<assembled prompt from Step 2 — for speech, the literal script>"

The command prints one line of JSON: {"file": {"name": "...", ...}}. The bytes land in the project; the FileViewer renders the audio transport controls automatically.

Step 4 — Hand off

Reply with: plan summary, the filename returned by the dispatcher, and one sentence on what to try if the user wants a variation (e.g. "swap tempo from 92 to 108 BPM" rather than "make it different").

Hard rules

TTS runs your script literally. Proof it before dispatching — even one stray comma changes the cadence.
MiniMax TTS rejects free-form voice prose in --voice. Use a real MiniMax voice_id (for example male-qn-qingse) or omit the flag and let the daemon's default voice apply.
Music: under 30s = single section; 30–90s = intro + body; 90s+ = full arc. Don't try to fit a 3-act song into 15 seconds.
SFX: prefer one well-described layer over a paragraph of "make it cool" — generators reward specific texture words.
Save the file every turn. The audio viewer shows transport controls the moment the file lands.

Info

Category Artificial Intelligence

Name audio-jingle

Version v20260501

Size 3.94KB

Source nexu-io/open-design

Updated At 2026-05-02