Projects 8 min read

Stars Press Start: 21 Passes to Teach TTS to Sing

A 105-second music video built in 2 days and 21 passes: a Nevada stargazer sees constellations as arcade sprites, gets abducted by stick-figure aliens, and rides a Tempest vector tunnel to Bach's Toccata in D minor reborn as chiptune rave.

Stars Press Start: 21 Passes to Teach TTS to Sing

Stars Press Start: 21 Passes to Teach TTS to Sing

105 seconds. Seven voices. 21 passes. One beam of light.

The pitch was simple: a Weird-Al-style comedy music video where a man on a Nevada porch sees constellations as arcade games, gets abducted by stick-figure aliens, rides through a Tempest vector tunnel, and returns at dawn to a blinking pixel star. Bach's Toccata in D minor, reborn as chiptune rave, with an ethereal female duet partner and a Vincent-Price alien monologue at the end.

The challenge, over 21 iterations, was making a text-to-speech voice sound musical — not just read fast, but sing, land, move someone. Here's what actually worked.

The Film

Five acts. 105 seconds. 128 BPM.

I. THE PORCH. Nevada dusk, sage-smoke air, warm concrete. A stargazer with a star chart, watching the universe boot up. "Stars press Start." The hook appears first, before anything else earns it.

II. THE GAMES. The rock kicks in. Cassiopeia eats the dark like Pac-Man quarters. Orion's belt is Tetris that never stops. Invaders march the Milky Way in programmed parade. Frogger hops the ecliptic. The arcade roll call arrives at full speed: Galaga, Defender, Joust, Dig Dug, Kong, Contra, Pitfall, Paperboy, Missile Command. Then the Konami code.

III. THE BEAM. A dramatic stop. "Wait. Not a satellite. Not a plane. That light knows frequencies I can't name." The alien contact chord. Full EDM rave deferred by one more breath.

IV. THE VOYAGE. The drop. Stick-figure aliens at a Game Boy dashboard. Garden hose spiraling past Jupiter. Tempest vector tunnels. The bliss riff arrives at the peak — 24 Fibonacci-positioned notes floating and rushing at once. Then the philosophical bridge with Bella's duet voice answering: "Stars are old. Games are older still."

V. THE RETURN. Dawn cracks the sage. A single pixel blinks in the corner of the sky — a stuck star winking like a save-point goodbye. A menacing voice from somewhere in the alien dark: the game was always already playing. Press... Start. Fade to black.

The Lyrics Carry the Song

Pass 2 traded rich imagery for catchy hooks and became meaningless. Pass 3 restored "every pixel is a promise to the algorithmic mind" and suddenly the voice had something to deliver.

The lesson: short hooks repeat well, but the verses between them must carry weight. "Stars press Start" anchors the film six times. Everything else — the sage-smoke air, Cassiopeia eating the dark like quarters in a slot, the garden hose spiraling below gravity's joke — has to earn each return of that line.

TTS Can't Sing, So It Has to Rap

ElevenLabs v3 at speed 2.40 plus ffmpeg atempo 1.5× equaled effective 3.6× delivery. Result: chipmunk. Pass 8 fixed this: 1.25× natural speed plus atempo 2.0 for a cleaner 2.5×.

Pass 9 dropped atempo entirely and introduced three-voice Daniel self-stacking: lead plus pitched +2 semitones at −11 dB plus pitched −3 semitones at −9 dB via rubberband with formants preserved. One voice with weight.

Pass 11 pulled the self-layer volumes back to −16/−14 dB because they muddied meaning in quieter sections. Pass 12 introduced voice delivery routing: four paths across the film.

  • CLEAN — contemplative verses, no stack
  • STACK — dense rapid-fire sections
  • HOOK — pitch-stacked harmonies on the six "Stars press Start" anchors
  • PRICE — villain_shadow with cathedral reverb and −1 semitone pitch-down for the alien monologue

The Duet Partner

Daniel sits slightly right in the stereo field. Bella sits slightly left, pitched +2 semitones with heavy reverb, answering his lines across the stereo field. Not background chorus. Not lead. A duet partner.

Her 21 fragments thread through the bridge and the voyage peak. She is why the bridge works. Pass 10's one change. Every subsequent pass kept her.

Math as Music Theory

Three composition techniques from this project that will appear in every future Napkin Films score:

Algorithmic velocity curves. Every melody note's velocity is computed by a double-sigmoid function that accents phrase starts and endings while dipping through the middle — mimicking human emphasis patterns, but mathematically exact. No flat-velocity chiptune.

Walking chromatic bass. The bass in each bar walks from this bar's root to next bar's root through scale and chromatic approach tones. Eight notes per bar. Constant forward motion. Bach did this. Every competent jazz bassist does this. Chiptune rarely does.

24 notes of bliss in 4 bars. Fibonacci-normalized positions within each 16-step bar: [0, 2, 4, 7, 10, 13]. Position 10/16 ≈ 0.625 ≈ 1/φ — the golden ratio gets the velocity accent. Six notes per bar is a 3:2 polyrhythm over 4/4. Feels rushed and floating at once. This is the Act 4 voyage peak.

The Neapolitan Moment

One Eb major chord where you expected Dm. A harmonic surprise that took thirty seconds to write and makes the Act 4 bridge feel like a moment, not a continuation. Bach used these constantly. Nobody applies them in chiptune because chiptune is supposed to be simple. Simplicity and surprise are the same skill.

Gospel in the Chord

73 humming clips — "mmm" from narrator_warm, narrator_deep, child_curious, elder_wise — each pitch-shifted via rubberband to the current bar's actual chord tone (root, third, fifth, or seventh of the Dm7/Fmaj7/Am7/B♭maj7/C7 progression). When the pad plays Fmaj7, three voices hum F, A, and C. Real harmony, not generic wordless vowels.

SFX as Composition

40 classic video game sounds — waka-waka, Tetris drop, invader march, Frogger hop, coin ting, 1UP, Konami code, save-point fanfare, alien contact chord — all recreated in ChipForge (square waves, LFSR noise, pitch sweeps). All tuned to D minor. All placed on bar downbeats at 128 BPM. When the video says Pac-Man, a waka-waka fires on the downbeat, in key. Punctuation, not sprinkles.

The Score

D minor. 10 ChipForge channels. J. S. Bach's Toccata and Fugue in D minor BWV 565 (c. 1704, public domain) as the foundation — the D–C♯–D mordent, the descending flourish, the fugue subject — recurring melodic cells across all five acts.

Style progression: Classical → Rock → EDM Rave across the film. 48 bars of main content at 128 BPM plus a Twinkle-style coda equivalent plus the alien outro. Five acts. 224 total discrete audio events at 2.13 per second, every one of them mathematically placed and in key.

The Cast

Character Persona Provider Role
Narrator hero_determined ElevenLabs Daniel All verses, hooks
Duet partner narrator_warm ElevenLabs Bella 21 pitched counter-fragments
Alien villain_shadow ElevenLabs Josh Beam scene, closing monologue
Gospel chorus four personas ElevenLabs 73 chord-tuned humming clips
BG ensemble five personas ElevenLabs 42 ambient whispers and ad-libs

21 Passes in 2 Days

Every stem stays archived. archive_film_pass.sh snapshots every version's audio stems via hardlinks. Iterating is cheap: change one line in the scene file, regenerate only the affected voice stem (cached TTS reused if the text is unchanged), remix in four seconds. 21 passes across 2 days was only possible because the pipeline ran surgical re-edits, not full re-renders.

Day 1 fixed structure — hook-based lyrics, voice stack, walking bass, Neapolitan chord, 7th voicings. Day 2 fixed texture — 40 chiptune SFX, Bella duet, gospel hums, Vincent Price outro, silence trimming, beat alignment, 300–500 ms breathing room before every line. The math was right from pass 3. The voice took until pass 21. Which sounds about right for anything worth making.

The Stack

Same as every Napkin Films production:

  • Animation: Python + PIL, stick figures drawn frame by frame, 854×480 at 12fps
  • Music: ChipForge, numpy-based chiptune synthesis (Bach public domain foundation, original AI composition)
  • Voices: ElevenLabs v3, seven character personas, formant-preserving pitch shifting via rubberband
  • Assembly: FFmpeg, custom finalmix script with per-act ducking windows and compressor on voice bus
  • Direction: Claude Code, Opus 4.6, 21 refinement passes

No GPU. No subscriptions beyond ElevenLabs. No animation software.

What This Proves

The previous Napkin Films entries tested drama, philosophy, comedy, and music. This one tests whether a TTS voice can be pushed to feel like a performance — not just clear delivery, but rhythm, weight, and something that sounds like it was composed for the music underneath it.

The answer is yes, at pass 21. Not at pass 1. The stack is fast enough that 21 passes in 2 days is not a grind — it's iteration. Each pass is 30 minutes of rendering plus 10 minutes of listening plus 5 minutes of directing the next change. The work happens in the listening.

Watch Stars Press Start on YouTube. The engine is open source (GPL-3.0). The film is Creative Commons (CC BY-NC 4.0). The ChipForge score is CC BY-SA 4.0. The Bach foundation is public domain.

Not a star chart. This is a rave.

If you want the full origin story of the Napkin Films stack, Four Films From Code covers how the architecture works and why constraint is the feature. Day two's batch is in HUMANS.EXE and Ten Thousand Days. The day before this one: Deduct Yourself: A Tax Day Hallucination — Grieg's Mountain King, a bunny oracle, and double-time rap over ten passes. The Agentic Development post explains the agent mode workflow underneath all of it.


Produced with Napkin Films and ChipForge, open source tools built by Joshua Ayson and AI agents at Organic Arts LLC.